# Chapter 5: Files

In this chapter, we will learn how to work with files on disk, and introduce some important concepts along the way: the use of external libraries, character encodings and file paths.

---

## Final exercises

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

* Ex. 1: go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out-of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by increasing frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization. Find out how you can sort a dictionary by value -- there are several ways of doing this, search the web in order to get some help. As a bonus exercise, add code so that the frequency dictionary ignores punctuation (hint: check out `string.punctuation` to get all punctuation).

In [3]:
import codecs
f = codecs.open('data2/marathon.txt', 'r', 'utf-8')
m = f.read()
f.close()
#print(m)
m = m.lower()
#print(m)

import string
exclude = set(string.punctuation)
m = ''.join(ch for ch in m if ch not in exclude)
#print(m)

freq = {}
words = m.split()
for word in words:
    if word not in freq:
        freq[word] = 1
    else:
        freq[word] += 1

from operator import itemgetter
from collections import OrderedDict
sorted_freq = OrderedDict(sorted(freq.items(), key=itemgetter(1)))

t_out = ''
for word in sorted_freq:
    t_out += str(word) + ': ' + str(sorted_freq[word]) + '\r\n'
#print(t_out)

g = codecs.open('data2/frequencies.txt', 'w', 'utf-8')
g.write(t_out)
g.close()

* Ex. 2: rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

In [15]:
import codecs
f = codecs.open('data2/marathon.txt', 'r', 'utf-8')
m = f.read()
f.close()
#print(m)

m_me = m.replace('Simon', 'Jonas')
#print(m_me)

g = codecs.open('data2/starring_me.txt', 'w', 'utf-8')
g.write(m_me)
g.close()

* Ex. 3: Write a program that takes a text file (e.g. `filename.txt`) and creates a new text file (e.g. `filename_numbered.txt`) in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file), i.e. prepend the number and a space to each line.

In [4]:
import codecs
f = codecs.open('data2/marathon.txt', 'r', 'utf-8')
m = f.read()
f.close()
#print(m)

lines = m.split("\r\n")
#print(lines)

c = 0
t_out = ''
for line in lines:
    c += 1
    t_out += str(c) + "  " + str(line) + "\r\n"
print(t_out)

g = codecs.open('data2/marathon_numbered.txt', 'w', 'utf-8')
g.write(t_out)
g.close()

1  ﻿The Project Gutenberg EBook of Marathon, by G. H. Betz
2  
3  This eBook is for the use of anyone anywhere at no cost and with
4  almost no restrictions whatsoever.  You may copy it, give it away or
5  re-use it under the terms of the Project Gutenberg License included
6  with this eBook or online at www.gutenberg.org
7  
8  
9  Title: Marathon
10  
11  Author: G. H. Betz
12  
13  Release Date: September 5, 2010 [EBook #33641]
14  
15  Language: Dutch
16  
17  
18  *** START OF THIS PROJECT GUTENBERG EBOOK MARATHON ***
19  
20  
21  
22  
23  Produced by Jeroen Hellingman and the Online Distributed
24  Proofreading Team at http://www.pgdp.net/ for Project
25  Gutenberg.
26  
27  
28  
29  
30  
31  
32  
33  
34                               Mr. G. H. Betz
35  
36                                  MARATHON
37  
38  
39  
40                                 Amsterdam
41  
42                      Uitgevers-Maatschappy "Elsevier"
43  
44       

* Advanced bonus exercise for if you feel like trying something crazy: a *sentence splitter* is a program capable of splitting a text into sentences. The standard set of heuristics for sentence splitting includes (but isn't limited to) the following rules: Sentence boundaries occur at one of "." (periods), "?" or "!", except that:

> - Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
> - Periods followed by a digit with no intervening whitespace are not sentence boundaries.
> - Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
> - Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
> - Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.

You might want to check out string functions, like `.islower()` and `.isalpha()` in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: `"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."` The result written to the new file should be:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.


------------------------------

You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Chapter 6: Functions

## Final exercises

**When you make the exercises below, don't write your code in the IPython notebook anymore but write in a separate file and run them from the command line!**

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

- `dice.py`: write a script that rolls a dice every time you run it, by generating and printing a random integer between 1 and 6! You can import functionality for doing this via `random.randint()`.

-  `arithmetic.py`: define a function add() and a function multiply() that sums and multiplies (respectively) all the numbers in a list of numbers. For example, add([1, 2, 3, 4]) should return 10, and multiply([1, 2, 3, 4]) should return 24.


-  `anagram.py`: two words are anagrams if you can rearrange the letters from one to spell the other. Write a function called is_anagram that takes two strings and returns True if they are anagrams.

-  `hapax1.py`: a *hapax legomenon* (often abbreviated to hapax) is a word which occurs only once in either the written record of a language, the works of an author, or in a single text. Define a function `legomena` that given the filename of a text will return a list of all its hapax legomena. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Try out the function on your Gutenberg book from the previous Chapter. For simplicity, make sure your Gutenberg file is in the same directory as your hapax script, so that you can just use the file's name as a relative path. Alternatively, you can use an absolute path to the file.

* `hapax2.py`: copy `hapax1.py` and try to move well-defined steps from your `legomena` function (reading and cleaning the input text, making a frequency dictionary) into separate functions, which are then called in the `legomena` function. This is called *code refactoring*: splitting multi-step functionality over several functions. This is good practice, and will make the next exercise much easier.

* `hapax3.py`: copy `hapax2.py` and create two additional functions: one that spots hapax `dislegomena` (words occuring only twice) and one that spots hapax `trislegomena` (words occuring only three times) in a text file.

- `calling_hapax.py`: in this standalone script, import the functions from `hapax3.py` and call all three functions from there. Again, try them out on your Gutenberg-file.

------------------------------

You've reached the end of Chapter 6! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()