### Opening a File<br>
* When we want to write a program that opens and reads a file, that program needs to tell Python where that file is
* By default, Python assumes that the file we want to read is in the same directory as the program that is doing
the reading

In [None]:
file = open('file_example.txt', 'r')
contents = file.read()
print(contents)
file.close()

* It is important that we save the two files in the same directory
* Built-in function $\rm\color{orange}{open}$ opens a file (much like we open a book when we want to read it) and returns an object that knows how to get information from the file, how much we have read, and which part of the file we are about to read next
* The marker that keeps track of the current location in the file is called a $\rm\color{magenta}{file\space cursor}$ and acts much like a bookmark
* The file cursor is initially at the beginning of the file, but as we read or write data it moves to the end of what we just read or wrote

* The first argument in the example call on function open, 'file_example.txt', is the name of the file to open, and the second argument, '$\rm\color{orange}{r}$', tells Python that we want to read the file; this is called the file $\rm\color{magenta}{mode}$
    * Other options for the mode include '$\rm\color{orange}{w}$' for writing and '$\rm\color{orange}{a}$' for appending, which we will see later in this lecture
* If we call open with only the name of the file (omitting the mode), then the default is 'r'
* The second statement, $contents=file.read()$, tells Python that we want to read the contents of the entire file into a string, which we assign to a variable called $\rm\color{magenta}{contents}$
* The third statement prints that string
* When we run the program, we will see that newline characters are treated just like every other character; a newline character is just another character in the file
* The last statement, $file.close()$, releases all resources associated with the open file object

### The with Statement<br>
* Because every call on function open should have a corresponding call on method close, Python provides a $\rm\color{orange}{with}$ statement that automatically closes a file when the end of the block is reached
* Here is the same example using a with statement:
  
        with open('file_example.txt', 'r') as file:
            contents = file.read()
        
        print(contents)
  
* The general form of a with statement is as follows:
  
        with open(«filename», «mode») as «variable»:
            «block»

### Techniques for Reading Files<br>
* Python provides several techniques for reading files
* All of these techniques work starting at the current file cursor. That allows us to combine the techniques as we need to

### The Read Technique<br>
* Use this technique when we want to read the contents of a file into a single string, or when we want to specify exactly how many characters to read

In [None]:
with open('file_example.txt', 'r') as file:
    contents = file.read()

print(contents)

* When called with no arguments, it reads everything from the current file cursor all the way to the end of the file and moves the file cursor to the end of the file
* When called with one integer argument, it reads that many characters and moves the file cursor after the characters that were just read
* Here is a version of the same program that reads ten characters and then the rest of the file
    * Method call example_file.read(10) moves the file cursor
    * The next call, example_file.read(), reads everything from character 11 to the end of the file

In [None]:
with open('file_example.txt', 'r') as example_file:
    first_ten_chars = example_file.read(10)
    the_rest = example_file.read()
    
print(f'The first 10 characters: {first_ten_chars}')
print(f'The rest of the file: {the_rest}')

### Reading at the End of a File<br>
* When the file cursor is at the end of the file, functions read, readlines, and readline all return an empty string
* In order to read the contents of a file a second time, we will need to close and reopen the file

### The Readlines Technique<br>
* Use this technique when we want to get a Python list of strings containing the individual lines from a file
* Function $\rm\color{orange}{readlines}$ works much like function $\rm\color{orange}{read}$, except that it splits up the lines into a list of strings
* As with $\rm\color{orange}{read}$, the file cursor is moved to the end of the file

In [None]:
with open('file_example.txt', 'r') as example_file:
    lines = example_file.readlines()

print(lines)

* Take a close look at that list; we will see that each line ends in $\rm{\backslash n}$ characters
* Python does not remove any characters from what is read; it only splits them into separate strings
* The last line of a file may or may not end with a newline character

* The file planets.txt contains the following text
  
        Mercury
        Venus
        Earth
        Mars
  
* The following example prints the lines in planets.txt backward, from the last line to the first
* Here, we use built-in function $\rm\color{orange}{reversed}$, which returns the items in the list in reverse order

In [None]:
with open('planets.txt', 'r') as planets_file:
    planets = planets_file.readlines()

print(planets)

for planet in reversed(planets):
    print(planet.strip())

* We can use the $\rm\color{orange}{readlines}$ technique to read the file, sort the lines, and print the planets alphabetically
* Here, we use built-in function $\rm\color{orange}{sorted}$, which returns the items in the list in order from smallest to largest

In [None]:
with open('planets.txt', 'r') as planets_file:
    planets = planets_file.readlines()
    
print(planets)

for planet in sorted(planets):
    print(planet.strip())

### The "For Line in File" Technique<br>
* Use this technique when we want to do the same thing to every line from the file cursor to the end of a file
* On each iteration, the file cursor is moved to the beginning of the next line
* The following code opens file planets.txt and prints the length of each line in that file

In [None]:
with open('planets.txt', 'r') as data_file:
    for line in data_file:
        print(len(line))

* Take a close look at the last line of output. There are only four characters in the word Mars, but our program is reporting that the line is five characters long
* The reason for this is the same as for function readlines: Each of the lines we read from the file has a $\rm\color{magenta}{newline}$ character at the end
* We can get rid of it using string method $\rm\color{orange}{strip}$, which returns a copy of a string that has leading
and trailing whitespace characters (spaces, tabs, and newlines) stripped away

In [None]:
with open('planets.txt', 'r') as data_file:
    for line in data_file:
        print(len(line.strip()))

### The Readline Technique<br>
* This technique reads one line at a time, unlike the Readlines technique
* Use this technique when we want to read only part of a file
* For example, we might want to treat lines differently depending on context; perhaps we want to process a file that has a header section followed by a series of records, either one record per line or with multiline records
* The following data, taken from the $\rm\color{cyan}{Time\space Series\space Data\space Library}$, describes the number of colored fox fur pelts produced in Hopedale, Labrador, in the years 1834–1842 (The full data set has values for the years 1834–1925.)

In [None]:
with open('hopedale.txt', 'r') as hopedale_file:
    # Read the description line
    hopedale_file.readline()

    # Keep reading comment lines until we read the first piece of data
    data = hopedale_file.readline().strip()
    while data.startswith('#'):
        data = hopedale_file.readline().strip()

    # Now we have the first piece of data. Accumulate the total number of pelts
    total_pelts = int(data)

    # Read the rest of the data
    for data in hopedale_file:
        total_pelts = total_pelts + int(data.strip())
    
print(f'Total number of pelts: {total_pelts}')

* Sometimes leading whitespace is important and we will want to preserve it
* In the Hopedale data, for example, the integers are right-justified to make them line up nicely
* In order to preserve this, you can use $\rm\color{orange}{rstrip}$ instead of strip to remove the trailing newline

In [None]:
with open('hopedale.txt', 'r') as hopedale_file:
    # Read the description line
    hopedale_file.readline()

    # Keep reading comment lines until we read the first piece of data
    data = hopedale_file.readline().rstrip()
    while data.startswith('#'):
        data = hopedale_file.readline().rstrip()

    # Now we have the first piece of data
    print(data)
    
    # Read the rest of the data
    for data in hopedale_file:
        print(data.rstrip())

### Files over the Internet<br>
* These days, of course, the file containing the data we want could be on a machine half a world away
* Provided the file is accessible over the Internet, though, we can read it just as we do a local file
* For example, the Hopedale data not only exists on our computers, but it is also on a web page
* The URL for the file is [here](http://robjhyndman.com/tsdldata/ecology1/hopedale.dat)

* Module $\rm\color{orange}{urllib.urlrequest}$ contains a function called $\rm\color{orange}{urlopen}$ that opens a web page for reading
* $\rm\color{orange}{urlopen}$ returns a file-like object that we can use much as if we were reading a local file
* There is a hitch: Because there are many kinds of files (images, music, videos, text, and more), the file-like object’s read and readline methods both return a type we have not yet encountered: bytes

### What Is a Byte?<br>
* To a computer, information is nothing but bits, which we think of as ones and zeros
* All data—for example, characters, sounds, and pixels—are represented as sequences of bits
* Computers organize these bits into groups of $\rm\color{magenta}{eight}$. Each group of eight bits is called a $\rm\color{magenta}{byte}$
* Programming languages interpret these bytes for us and let us think of them as integers, strings, functions, and documents

* When dealing with type bytes, such as a piece of information returned by a call on function urllib.urlrequest.read, we need to $\rm\color{magenta}{decode}$ it
* In order to decode it, we need to know how it was $\rm\color{magenta}{encoded}$
* Common encoding schemes are described in the online Python documentation [here](http://docs.python.org/3/library/codecs.html#standard-encodings)
* The Hopedale data on the Web is encoded using $\rm\color{magenta}{UTF-8}$
* This program reads that web page and uses string method decode in order to decode the bytes object

In [None]:
import urllib.request
url = 'http://robjhyndman.com/tsdldata/ecology1/hopedale.dat'
with urllib.request.urlopen(url) as webpage:
    for line in webpage:
        line = line.strip()
        line = line.decode('utf-8')
        print(line)

### Writing Files<br>
* This program opens a file called topics.txt, writes the words $\rm\color{cyan}{Computer\space Science}$ to the file, and then closes the file

In [None]:
with open('topics.txt', 'w') as output_file:
    output_file.write('Computer Science')

* In addition to writing characters to a file, method write $\rm\color{cyan}{returns}$ the number of characters written. For example, output_file.write('Computer Science') returns $16$
* To create a new file or to replace the contents of an existing file, we use write mode ('$\rm\color{orange}{w}$'). If the filename does not exist already, then a new file is created; otherwise the file contents are $\rm\color{magenta}{erased\space and\space replaced}$
* Once opened for writing, we can use method write to write a string to the file
* Rather than replacing the file contents, we can also add to a file using the append mode ('$\rm\color{orange}{a}$')
* When we write to a file that is opened in append mode, the data we write is added to the end of the file and the current file contents are not overwritten

In [None]:
with open('topics.txt', 'a') as output_file:
    output_file.write('Software Engineering')

* At this point, if we print the contents of topics.txt, we would see the following
  
        Computer ScienceSoftware Engineering
  
* Unlike function print, method write does not automatically start a new line; if we want a string to end in a newline, we have to include it manually using '$\rm{\backslash n}$'
* In each of the previous examples, we called write only once, but we will typically call it multiple times

* The next example is more complex, and it involves both reading from and writing to a file
* Our input file contains two numbers per line separated by a space
* The output file will contain three numbers: The two from the input file and their sum (all separated by spaces)

In [None]:
def sum_number_pairs(input_file, output_filename):
    """ (file open for reading, str) -> NoneType
    Read the data from input_file, which contains two floats per line
    separated by a space. Open file named output_file and, for each line in
    input_file, write a line to the output file that contains the two floats
    from the corresponding line of input_file plus a space and the sum of the
    two floats.
    """

    with open(output_filename, 'w') as output_file:
        for number_pair in input_file:
            number_pair = number_pair.strip()
            operands = number_pair.split()
            total = float(operands[0]) + float(operands[1])
            new_line = f'{number_pair} {total}\n'
            output_file.write(new_line)
            
sum_number_pairs(open('number_pairs.txt', 'r'), 'out.txt')

### Writing Algorithms That Use the File-Reading Techniques<br>
* There are several common ways to organize information in files
* The rest of this lecture will show how to apply the various file-reading techniques to these situations and how to develop some algorithms to help with this

### Skipping the Header<br>
* Many data files begin with a $\rm\color{magenta}{header}$
* As described earlier, TSDL files begin with a one-line description followed by comments in lines beginning with a $\rm\color{orange}{\#}$, and the Readline technique can be used to skip that header
* The technique ends when we read the first real piece of data, which will be the first line after the description that does not start with a $\rm\color{orange}{\#}$

* In English, we might try this $\rm\color{cyan}{algorithm}$ to process this kind of a file
  
        Skip the first line in the file
        Skip over the comment lines in the file
        For each of the remaining lines in the file:
            Process the data on that line
  
* The problem with this approach is that we cannot tell whether a line is a comment line until we have read it, but we can read a line from a file only once—there is no simple way to "back up" in the file
* An alternative approach is to read the line, skip it if it is a comment, and process it if it is not
* Once we have processed the first line of data, we process the remaining lines
  
        Skip the first line in the file
        Find and process the first line of data in the file
        For each of the remaining lines:
            Process the data on that line
  
* The thing to notice about this algorithm is that it processes lines in two places
    * Once when it finds the first "interesting" line in the file and
    * Once when it handles all of the following lines

In [None]:
def skip_header(reader):
    """ (file open for reading) -> str
    Skip the header in reader and return the first real piece of data.
    """

    # Read the description line
    line = reader.readline()
    # Find the first non-comment line
    line = reader.readline()
    while line.startswith('#'):
        line = reader.readline()
    # Now line contains the first real piece of data
    
    return line

def process_file(reader):
    """ (file open for reading) -> NoneType
    Read and print the data from reader, which must start with a single
    description line, then a sequence of lines beginning with '#', then a
    sequence of data.
    """
    
    # Find and print the first piece of data
    line = skip_header(reader).strip()
    print(line)
    # Read the rest of the data
    for line in reader:
        line = line.strip()
        print(line)

with open('hopedale.txt', 'r') as input_file:
    process_file(input_file)

* In $\rm\color{cyan}{skip\_header}$, we return the first line of read data, because once we have found it, we cannot read it again (we can go forward but not backward)
* We will want to use $\rm\color{cyan}{skip\_header}$ in all of the file-processing functions in this section

* This program processes the Hopedale data set to find the smallest number of fox pelts produced in any year
* As we progress through the file, we keep the smallest value seen so far in a variable called smallest
* That variable is initially set to the value on the first line, since it is the smallest (and only) value seen so far

In [None]:
def smallest_value(reader):
    """ (file open for reading) -> NoneType
    Read and process reader and return the smallest value after the
    time_series header.
    """
    
    line = skip_header(reader).strip()
    # Now line contains the first data value; this is also the smallest value
    # found so far, because it is the only one we have seen
    smallest = int(line)
    for line in reader:
        value = int(line.strip())
    # If we find a smaller value, remember it
    if value < smallest:
        smallest = value
        
    # We can replace the if statement with this single line
#     smallest = min(smallest, value)

    return smallest

with open('hopedale.txt', 'r') as input_file:
    print(smallest_value(input_file))

### Dealing with Missing Values in Data<br>
* We also have data for colored fox production in Hebron, Labrador
  
        Coloured fox fur production, Hebron, Labrador, 1834-1839
        #Source: C. Elton (1942) "Voles, Mice and Lemmings", Oxford Univ. Press
        #Table 17, p.265--266
        #remark: missing value for 1836
            55
           262
           -
           102
           178
           227
  
* The hyphen indicates that data for the year 1836 is missing
* Unfortunately, calling read_smallest on the Hebron data produces this error

In [None]:
smallest_value(open('hebron.txt', 'r'))

* The problem is that '-' is not an integer, so calling int('-') fails
* This is not an isolated problem. In general, we will often need to skip blank lines, comments, or lines containing other "nonvalues" in our data
* Real data sets often contain omissions or contradictions; dealing with them is just a fact of scientific life
* To fix our code, we must add a check inside the loop that processes a line only if it contains a real value
* In the TSDL data sets, missing entries are always marked with hyphens, so we just need to check for that before trying
to convert the string we have read to an integer
* Notice that the update to smallest is nested inside the check for hyphens

In [None]:
def smallest_value_skip(reader):
    """ (file open for reading) -> NoneType
    Read and process reader, which must start with a time_series header.
    Return the smallest value after the header. Skip missing values, which
    are indicated with a hyphen.
    """

    line = time_series.skip_header(reader).strip()
    # Now line contains the first data value; this is also the smallest value
    # found so far, because it is the only one we have seen
    smallest = int(line)
    for line in reader:
        line = line.strip()
        if line != '-':
            value = int(line)
            smallest = min(smallest, value)

    return smallest

with open('hebron.txt', 'r') as input_file:
    print(smallest_value_skip(input_file))

### Processing Whitespace-Delimited Data<br>
* The file at [here](http://robjhyndman.com/tsdldata/ecology1/lynx.dat) (Time Series Data Library [Hyn06]) contains information about lynx pelts in the years 1821–1934
* All data values are integers, each line contains many values, the values are separated by whitespace, and for reasons best known to the file’s author, each value ends with a period

* To process this, we will break each line into pieces and strip off the periods
* Our algorithm is the same as it was for the fox pelt data: Find and process the first line of data in the file, and then process each of the subsequent lines
* However, the notion of "processing a line" needs to be examined further because there are many values per line
* Our refined algorithm, shown next, uses nested loops to handle the notion of "for each line and for each value on that line"
  
        Find the first line containing real data after the header
        For each piece of data in the current line:
            Process that piece
        For each of the remaining lines of data:
            For each piece of data in the current line:
                Process that piece
  
* Once again we are processing lines in two different places
* That is a strong hint that we should write a helper function to avoid duplicate code
* Rewriting our algorithm and making it specific to the problem of finding the largest value, makes this clearer
  
        Find the first line of real data after the header
        Find the largest value in that line
        
        For each of the remaining lines of data:
            Find the largest value in that line
                If that value is larger than the previous largest, remember it

* The helper function required is one that finds the largest value in a line, and it must split up the line
* String method $\rm\color{orange}{split}$ will split around the whitespace, but we still have to remove the periods at the ends of the values
* We can also simplify our code by initializing largest to $-1$, because that value is guaranteed to be smaller than any of the (positive) values in the file
* That way, no matter what the first real value is, it will be larger than the "previous" value (our $-1$) and replace it

In [None]:
def find_largest(line):
    """ (str) -> int
    Return the largest value in line, which is a whitespace-delimited string
    of integers that each end with a '.'.
    >>> find_largest('1. 3. 2. 5. 2.')
    5
    """

    # The largest value seen so far
    largest = -1
    for value in line.split():
        # Remove the trailing period
        v = int(value[:-1])
        # If we find a larger value, remember it
        if v > largest:
            largest = v

    return largest

def process_file(reader):
    """ (file open for reading) -> int
    Read and process reader, which must start with a time_series header.
    Return the largest value after the header. There may be multiple pieces
    of data on each line.
    """

    line = skip_header(reader).strip()
    # The largest value so far is the largest on this first line of data.
    largest = find_largest(line)
    # Check the rest of the lines for larger values.
    for line in reader:
        large = find_largest(line)
        if large > largest:
            largest = large
            
    return largest

with open('lynx.txt', 'r') as input_file:
    print(process_file(input_file))

### Multiline Records<br>
* Not every data record will fit onto a single line
* $\rm\color{cyan}{multimol\_simple.pdb}$ is a file in simplified Protein Data Bank (PDB) format that describes the arrangements of atoms in ammonia
* The first line is the name of the molecule. All subsequent lines down to the one containing END specify the ID, type, and XYZ coordinates of one of the atoms in the molecule
* Reading this file is straightforward using the techniques we have built up in this lecture
* But what if the file contained two or more molecules, like $\rm\color{cyan}{multimol.pdb}$?

* As always, we tackle this problem by dividing into smaller ones and solving each of those in turn
* Our first algorithm is as follows
  
        While there are more molecules in the file:
            Read a molecule from the file
            Append it to the list of molecules read so far
  
* Simple, except the only way to tell whether there is another molecule left in the file is to try to read it
* Our modified algorithm is as follows
  
        reading = True
        while reading:
            Try to read a molecule from the file
            If there is one:
                Append it to the list of molecules read so far
            else: # nothing left
                reading = False

In [None]:
def read_molecule(reader):
    """ (file open for reading) -> list or NoneType
    Read a single molecule from reader and return it, or return None to signal
    end of file. The first item in the result is the name of the compound;
    each list contains an atom type and the X, Y, and Z coordinates of that
    atom.
    """

    # If there isn't another line, we're at the end of the file
    line = reader.readline()
    if not line:
        return None
    # Name of the molecule: "COMPND name"
    key, name = line.split()
    # Other lines are either "END" or "ATOM num atom_type x y z"
    molecule = [name]
    line = reader.readline()
    # Parse all the atoms in the molecule.
    while not line.startswith('END'):
        key, num, atom_type, x, y, z = line.split()
        molecule.append([atom_type, x, y, z])
        line = reader.readline()

    return molecule

def read_all_molecules(reader):
    """ (file open for reading) -> list
    Read zero or more molecules from reader, returning a list of the molecule
    information.
    """
    
    # The list of molecule information.
    result = []
    reading = True
    while reading:
        molecule = read_molecule(reader)
        if molecule: # None is treated as False in an if statement
            result.append(molecule)
        else:
            reading = False

    return result

molecule_file = open('multimol.pdb', 'r')
molecule = read_molecule(molecule_file)
molecules = read_all_molecules(molecule_file)
print(molecule)
print()
print(molecules)

* The work of actually reading a single molecule has been put in a function of its own that must return some false value (such as None) if it cannot find another molecule in the file
* This function checks the first line it tries to read to see whether there is actually any data left in the file
    * If not, it returns immediately to tell read_all_molecules that the end of the file has been reached
    * Otherwise, it pulls the name of the molecule out of the first line and then reads the molecule’s atoms one at a time down to the END line
* Notice that this function uses exactly the same trick to spot the END that marks the end of a single molecule as the first function used to spot the end of the file

In [None]:
def read_molecule(reader):
    """ (file open for reading) -> list or NoneType
    Read a single molecule from reader and return it, or return None to signal
    end of file. The first item in the result is the name of the compound;
    each list contains an atom type and the X, Y, and Z coordinates of that
    atom.
    """

    # If there isn't another line, we're at the end of the file.
    line = reader.readline()
    if not line:
        return None
    # Name of the molecule: "COMPND name"
    key, name = line.split()
    # Other lines are either "END" or "ATOM num atom_type x y z"
    molecule = [name]
    reading = True
    while reading:
        line = reader.readline()
        if line.startswith('END'):
            reading = False
        else:
            key, num, atom_type, x, y, z = line.split()
            molecule.append([atom_type, x, y, z])
    return molecule