# Chapter 5: Files

In this chapter, we will learn how to work with files on disk, and introduce some important concepts along the way: the use of external libraries, character encodings and file paths.

---

## Final exercises

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

* Ex. 1: go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out-of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by increasing frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization. Find out how you can sort a dictionary by value -- there are several ways of doing this, search the web in order to get some help. As a bonus exercise, add code so that the frequency dictionary ignores punctuation (hint: check out `string.punctuation` to get all punctuation).

In [1]:
#download text and make available as plain text
import os
import codecs
f = codecs.open('russian_short_stories.txt', 'r', 'utf-8')
t_in = f.read()
f.close()

#remove punctuation and uppercase letters
import string
translator = str.maketrans('', '', string.punctuation)
t_p = t_in.translate(translator)
t_l = t_p.lower()
#print(t_l)

#make a frequency dictionary
import collections
t_split = t_l.split()
freq = collections.Counter(t_split)
freq_dict = dict(freq)
for w, f in freq_dict.items():
    #print(w,f)

#sort the dictionary by increasing values
    import operator
f = freq_dict
sorted_f = sorted(f.items(), key=operator.itemgetter(1))
t_out = str(sorted_f)  
print(t_out)

#write to new file
f = codecs.open('frequencies.txt', 'w', 'utf-8')
f.write(t_out)
f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'russian_short_stories.txt'

* Ex. 2: rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

In [None]:
starring_me = t_in.replace("Tomsky", "Veerle")

f = codecs.open('starring_me.txt', 'w', 'utf-8')
f.write(starring_me)
f.close()

* Ex. 3: Write a program that takes a text file (e.g. `filename.txt`) and creates a new text file (e.g. `filename_numbered.txt`) in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file), i.e. prepend the number and a space to each line.

In [None]:
import codecs
f = codecs.open('russian_short_stories.txt', 'r', 'utf-8')
text = f.read()
f.close()

t_out = []
line_number = 0
for line in text.splitlines(): 
    if len(line) > 0:
        line_number += 1
        new_line = str(line_number) + " " + line  
        t_out.append(new_line)

t = "\n".join(t_out)
print(t)

#store it in new file
f = codecs.open ('numbered_lines.txt' , 'w', 'utf-8')
f.write(t)
f.close()

* Advanced bonus exercise for if you feel like trying something crazy: a *sentence splitter* is a program capable of splitting a text into sentences. The standard set of heuristics for sentence splitting includes (but isn't limited to) the following rules: Sentence boundaries occur at one of "." (periods), "?" or "!", except that:

> - Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
> - Periods followed by a digit with no intervening whitespace are not sentence boundaries.
> - Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
> - Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
> - Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.

You might want to check out string functions, like `.islower()` and `.isalpha()` in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: `"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."` The result written to the new file should be:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.


In [None]:
string = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."



------------------------------

You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()