# Chapter 5: Files

In this chapter, we will learn how to work with files on disk, and introduce some important concepts along the way: the use of external libraries, character encodings and file paths.

---

## Final exercises

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

* Ex. 1: go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out-of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by increasing frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization. Find out how you can sort a dictionary by value -- there are several ways of doing this, search the web in order to get some help. As a bonus exercise, add code so that the frequency dictionary ignores punctuation (hint: check out `string.punctuation` to get all punctuation).

In [38]:
import requests
import codecs
import string
# We will automatically download and place the file of Alice in Wonderland in the data dir.
# You will have done this manually for your book.
gutenberg_url = "http://www.gutenberg.org/files/11/11-0.txt" # url for .txt of Alice in Wonderland
r = requests.get(gutenberg_url)
text = r.text
filepath = "aliceinwonderland.txt"
f = codecs.open(filepath, "w", "utf-8")
f.write(text)
f.close()

# Actual solution
filepath = "aliceinwonderland.txt"
f = codecs.open(filepath, "r", "utf-8")
text = f.read()
f.close()

# tokenize and clean the punctuation
text = text.lower() # ignore capitalization
words = text.split()
words_no_punct = []
for word in words:
    if word not in string.punctuation: #filter tokenized punctuation
        letters_only = []
        for c in word:
            if c not in string.punctuation: # filter words themselves from possible punctuation
                letters_only.append(c)
        clean_word = "".join(letters_only)
        words_no_punct.append(clean_word)

# make the word frequency dictionary
word_freq = {}
for word in words:
    if word in word_freq:
        word_freq[word] += 1
    else:
        word_freq[word] = 1

# print in sorted manner
word_freq_tuples = list(word_freq.items()) # get the dict as a list of (key, value) tuples with .items()
word_freq_tuples.sort(key=lambda x: x[1], reverse=True) # sort allows you to specify a key to sort by. 
# this is the lambda function which iterates over every element in the list, of each element we want 
# to sort by the count value in the (key, value) tuple, i.e. tuple[1]
wordfreq_fp = "frequencies.txt"
f = codecs.open(wordfreq_fp, "w", "utf-8")
for word, freq in word_freq_tuples:
    f.write(word + "\t" + str(freq) + "\n")
f.close()

* Ex. 2: rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

In [41]:
filepath = "aliceinwonderland.txt"
f_in = codecs.open(filepath, "r", "utf-8")
text = f_in.read()
f_in.close()

my_text = text.replace("Alice", "Kilroy")
my_text_fp = "starring_me.txt"
f_out = codecs.open(my_text_fp, "w", "utf-8")
f_out.write(my_text)
f_out.close()

* Ex. 3: Write a program that takes a text file (e.g. `filename.txt`) and creates a new text file (e.g. `filename_numbered.txt`) in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file), i.e. prepend the number and a space to each line.

In [43]:
filepath = "aliceinwonderland.txt"
f_in = codecs.open(filepath, "r", "utf-8")
text = f_in.readlines()
f_in.close()

numbered_fp = "aliceinwonderlandnumbered.txt"
f_out = codecs.open(numbered_fp, "w", "utf-8")
for i, line in enumerate(text):
    f_out.write(str(i) + ": " + line)
f_out.close()

* Advanced bonus exercise for if you feel like trying something crazy: a *sentence splitter* is a program capable of splitting a text into sentences. The standard set of heuristics for sentence splitting includes (but isn't limited to) the following rules: Sentence boundaries occur at one of "." (periods), "?" or "!", except that:

> - Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
> - Periods followed by a digit with no intervening whitespace are not sentence boundaries.
> - Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
> - Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
> - Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.

You might want to check out string functions, like `.islower()` and `.isalpha()` in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: `"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."` The result written to the new file should be:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.


------------------------------

You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()