# Chapter 5: Files

In this chapter, we will learn how to work with files on disk, and introduce some important concepts along the way: the use of external libraries, character encodings and file paths.

---

## Final exercises

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

* Ex. 1: go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out-of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by increasing frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization. Find out how you can sort a dictionary by value -- there are several ways of doing this, search the web in order to get some help. As a bonus exercise, add code so that the frequency dictionary ignores punctuation (hint: check out `string.punctuation` to get all punctuation).

In [7]:
#Ex. 1: go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out-of-copyright book in plain text format.
#  Make a frequency dictionary of the words in the novel.
# Sort the words in the dictionary by increasing frequency and write it to a text file called frequencies.txt.
#  Make sure your program ignores capitalization.
# Find out how you can sort a dictionary by value -- there are several ways of doing this, search the web in order to get some help.
#  As a bonus exercise, add code so that the frequency dictionary ignores punctuation
#  (hint: check out string.punctuation to get all punctuation).

import codecs
f= codecs.open ("ThePictureOfDorianGray.txt", "r","utf-8")
t_input = f.read ()
f.close()
#print (t_input)


import re
list_tokens_original=re.findall(r"[\w']+|[.,!?;]", t_input.lower())
dict_tokens ={} #without punctuation marks, extra spaces: the pure words in lower case.
import string
#print (string.punctuation)
for token in list_tokens_original:
    #three possibilities:
    # a) new token --> add key, + value 1 to dict
    # b) not first occurence --> value += 1
    # c) token = punctuation mark --> ignore
   if token not in string.punctuation:
        if token not in dict_tokens:
            dict_tokens[token]=1
        elif token in dict_tokens:
            dict_tokens[token]= dict_tokens[token]+ 1

#print (dict_tokens)
#print (len(dict_tokens))

# Sort the words in the dictionary by increasing frequency and write it to a text file called frequencies.txt.

from collections import OrderedDict
sorted_frequency_dictionary = OrderedDict(sorted(dict_tokens.items(), key=lambda x: x[1]))
#print ( sorted_frequency_dictionary)

f= codecs.open("frequencies.txt","w","utf-8")
import json
json.dump(sorted_frequency_dictionary,f)
f.close()

* Ex. 2: rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

In [8]:
import codecs
f= codecs.open("ThePictureOfDorianGray.txt",'r', 'utf-8')
t_input = f.read()
f.close()

f2=codecs.open("starring_me.txt", "w", "utf-8")
f2.write(t_input.replace("Dorian", "Laura"))
f2.close()



* Ex. 3: Write a program that takes a text file (e.g. `filename.txt`) and creates a new text file (e.g. `filename_numbered.txt`) in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file), i.e. prepend the number and a space to each line.

In [10]:
#Ex. 3: Write a program that takes a text file (e.g. filename.txt) and creates a new text file (e.g. filename_numbered.txt)
#  in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file),
#  i.e. prepend the number and a space to each line.

def numbered_lines(filename_in): #this function generates a new text file ("filename_numbered.txt") with all the lines preceded by the line number
    import os
    base = os.path.basename(filename_in)
    base_out = (os.path.splitext(base)[0]) + "_numbered" + os.path.splitext(base)[1]
    filename_out= os.path._getfullpathname(base_out)

    #filename_in_pythonposition = os.path._getfullpathname(base)
    #print(filename_in_pythonposition)
   # print(filename_out)



    with open(filename_in,'r') as program:
        l_lines = program.readlines()

    with open(filename_out, 'w') as program:
        for (number, line) in enumerate(l_lines):
            program.write('%d  %s' % (number + 1, line))

filename_in="ThePictureOfDorianGray.txt"

numbered_lines(filename_in)





None


* Advanced bonus exercise for if you feel like trying something crazy: a *sentence splitter* is a program capable of splitting a text into sentences. The standard set of heuristics for sentence splitting includes (but isn't limited to) the following rules: Sentence boundaries occur at one of "." (periods), "?" or "!", except that:

> - Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
> - Periods followed by a digit with no intervening whitespace are not sentence boundaries.
> - Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
> - Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
> - Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.

You might want to check out string functions, like `.islower()` and `.isalpha()` in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: `"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."` The result written to the new file should be:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.


------------------------------

You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()