# Text Processing

-- * based on the Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*

---

In this session we will focus on one of the most important tasks in Humanities research: text processing. One of the goals of text processing is to clean up your data as a pre-step to some kind of data analysis. Another common goal is to convert a given text collection to a different format. In this session we will provide you with the necessary tools to work with collections of texts, clean them and perform some rudimentary data analyses on them.

In this notebook, you will look at opening a text file, reading the text, and counting some word occurences. The examples shown will introduce a few new Python concepts. These include reading from a file, loops, functions and dictionaries. 

As before, read the explanations in the cells, and execute the cells containing code to see how it works. There are also 5 exercises in this notebook which you have to do yourself.

## Reading files

The material for this course contained a small text file called "austen-emma-excerpt.txt". This should be saved in the same directory where you saved the notebook. Put it in a sub-directory called "data". Then use the function 'open' to open the "austen-emma-excerpt.txt": 


In [33]:
infile = open('data/austen-emma-excerpt.txt') 

We now print `infile`. What do you think that will happen?

In [34]:
print(infile)

<_io.TextIOWrapper name='data/austen-emma-excerpt.txt' mode='r' encoding='cp1252'>


"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `data/austen-emma-excerpt.txt`. In order to *read* the contents of the file we must add the function `read` as follows:

In [35]:
print(infile.read())

Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.


`read` is a function that operates on `TextWrapper` objects and allows us to read the contents of a file into Python. Let's assign the contents of the file to the variable `text`:

In [36]:
infile = open('data/austen-emma-excerpt.txt')
text = infile.read()

The variable `text` now holds the contents of the file `data/austen-emma-excerpt.txt` and we can access and manipulate it just like any other string. After we read the contents of a file, the `TextWrapper` no longer needs to be open. In fact, it is good practice to close it as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [37]:
infile.close()

---

## Writing our first function

Remember that in a previous lab, we counted the number of hashtags in Tweets? Counting objects in a text is a very common thing to do, and Python contains the function `count` especially for this purpose. The function operates on strings (`somestring.count(argument)`) and takes as argument the object you want to count. Using this function, we can easily count the number of "e"s in our text:

In [38]:
number_of_es = text.count("e")
print(number_of_es)

78


In fact, `count` takes as argument any string you would like to find. We could just as well count how often the determiner `an` occurs:

In [39]:
print(text.count("an"))

12


The string `an` is found 12 times in our text. Does that mean that the word *an* occurs 12 times in our text? Go ahead. Count it yourself. In fact, *an* occurs only twice... Think about this. Why does Python print 12?

If we want to count how often the word *an* occurs in the text and not the string `an`, we could surround *an* with spaces, like the following:

In [40]:
print(text.count(" an "))

2


Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of *an* followed by a semicolon or some end-of-sentence marker? Then we would need to query the text multiple times for each possible context of *an*. For that reason, we're going to approach the problem using a different, more sophisticated strategy. 

Recall from the previous chapter the function `split`. What does this function do? The function `split` operates on a string and splits a string on spaces and returns a list of smaller strings (or words):

In [41]:
print(text.split())

['Emma', 'by', 'Jane', 'Austen', '1816', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence;', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her.', 'She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate,', 'indulgent', 'father;', 'and', 'had,', 'in', 'consequence', 'of', 'her', "sister's", 'marriage,', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period.', 'Her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses;', 'and', 'her', 'place', 'had', 'been', 'supplied', 'by', 'an', 'excellent', 'woman', 'as', 'governess,', 'who', 'had', 'fallen'

---

#### Counting words

The next cell shows you an example of how to count how often certain words occur in a list of words.

In [42]:
words = text.split()
number_of_hits = 0
item_to_count = "in"
for word in words:
    if word == item_to_count:
        number_of_hits += 1

---

We will go through the code step by step. We would like to know how often the preposition *in* occurs in our text. As a first step we will split the string `text` into a list of words:

In [43]:
words = text.split()

Next we define a variable `number_of_hits` and set it to zero.

In [44]:
number_of_hits = 0

The final step is to loop over all words in `words` and add 1 to `number_of_ins` if we find a word that is equal to `in`:

In [45]:
item_to_count = "in"
for word in words:
    if word == item_to_count:
        number_of_hits += 1
print(number_of_hits)

3


Now, say we would like to know how often the word *of* occurs in our text. We could adapt the previous lines of code to search for the word *of*, but what if we also would like to count the number of times *the* occurs, and *house* and *had* and... It would be really cumbersome to repeat all these lines of code for each particular search term we have. Programming is supposed to reduce our workload, not increase it. Just like the function `count` for strings, we would like to have a function that operates on lists, takes as argument the object we would like to count and returns the number of times this object occurs in our list.

In this and the previous chapter you have already seen lots of functions. A function does something, often based on some argument you pass to it, and generally returns a result. You are not just limited to using functions in the standard library but you can write your own functions.

In fact, you *must* write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. Functions are defined using the `def` keyword, they take a name and optionally a number of parameters. 

    def some_name(optional_parameters):

The `return` statement returns a value back to the caller and always ends the execution of the function. 

Going back to our problem, we want to write a function called `count_in_list`. It takes two arguments: (1) the object we would like to count and (2) the list in which we want to count that object. Let's write down the function definition in Python:

    def count_in_list(item_to_count, list_to_search):
    
Do you understand all the syntax and keywords in the definition above? Now all we need to do is to add the lines of code we wrote before to the body of this function:

In [46]:
def count_in_list(item_to_count, list_to_search): 
    number_of_hits = 0                            
    for item in list_to_search:                   
        if item == item_to_count:                 
            number_of_hits += 1                   
    return number_of_hits                         

All code should be familiar to you, except the `return` keyword. The `return` keyword is there to tell python to return as a result of calling the function the argument `number_of_hits`. OK, let's go through our function one more time, just to make sure you really understand all of it.

1. First we define a function using `def` and give it the name `count_in_list` (line 1);
2. This function takes two arguments: `item_to_count` and `list_to_search` (line 1);
3. Within the function, we define a variable `number_of_hits` and assign to it the value zero (since at that stage we haven't found anything yet (line 2));
4. We loop over all words in `list_to_search` (line 3);
5. If we find a word that is equal to `item_to_count` (line 4), we add 1 to `number_of_hits` (line 5);
6. Return the result of `number_of_hits` (line 6).

Let's test our little function! We will first count how often the word *an* occurs in our list of words `words`.

In [47]:
print(count_in_list("an", words))

2


---

#### Exercise 1

Using the function we defined, print how often the word *the* occurs in our text

In [48]:
# insert your code here
print(count_in_list("the",words))

4


---

## A more general count function

Our function `count_in_list` is a concise and convenient piece of code allowing us to rapidly and without too much repitition count how often certain items occur in a given list. Now what if we would like to find out for all words in our text how often they occur. Then it would be still quite cumbersome to call our function for each unique word. We would like to have a function that takes as argument a particular list and counts for each unique item in that list how often it occurs. There are multiple ways of writing such a function. We will show you two ways of doing it.

### A count function (take 1)

In the previous chapter you have acquainted yourself with the `dictionary` structure. Recall that a dictionary consists of keys and values and allows you to quickly lookup a value. We will use a dictionary to write the function `counter` that takes as argument a list and returns a `dictionary` with `keys` for each unique item and `values` showing the number of times it occurs in the list. We will first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.

We start with defining a variable `counts` which is an empty dictionary:

In [49]:
counts = {}

Next we will loop over all words in our list `words`. For each word, we check whether the dictionary already contains it. If so, we add 1 to its value. If not, we add the word to the dictionary and assign to it the value 1.

In [50]:
for word in words:
    if word in counts:
        counts[word] = counts[word] + 1
    else:
        counts[word] = 1
print(counts)

{'Emma': 2, 'by': 2, 'Jane': 1, 'Austen': 1, '1816': 1, 'VOLUME': 1, 'I': 2, 'CHAPTER': 1, 'Woodhouse,': 1, 'handsome,': 1, 'clever,': 1, 'and': 5, 'rich,': 1, 'with': 2, 'a': 4, 'comfortable': 1, 'home': 1, 'happy': 1, 'disposition,': 1, 'seemed': 1, 'to': 3, 'unite': 1, 'some': 1, 'of': 8, 'the': 4, 'best': 1, 'blessings': 1, 'existence;': 1, 'had': 4, 'lived': 1, 'nearly': 1, 'twenty-one': 1, 'years': 1, 'in': 3, 'world': 1, 'very': 2, 'little': 2, 'distress': 1, 'or': 1, 'vex': 1, 'her.': 1, 'She': 1, 'was': 1, 'youngest': 1, 'two': 1, 'daughters': 1, 'most': 1, 'affectionate,': 1, 'indulgent': 1, 'father;': 1, 'had,': 1, 'consequence': 1, 'her': 4, "sister's": 1, 'marriage,': 1, 'been': 2, 'mistress': 1, 'his': 1, 'house': 1, 'from': 1, 'early': 1, 'period.': 1, 'Her': 1, 'mother': 2, 'died': 1, 'too': 1, 'long': 1, 'ago': 1, 'for': 1, 'have': 1, 'more': 1, 'than': 1, 'an': 2, 'indistinct': 1, 'remembrance': 1, 'caresses;': 1, 'place': 1, 'supplied': 1, 'excellent': 1, 'woman': 1,

If you don't remember anymore how dictionaries work, go back to the previous chapter and read the part about dictionaries once more.

Now that our code is working, we can add it to a function. We define the function `counter` using the `def` keyword. It takes one argument (`list_to_search`).

In [51]:
def counter(list_to_search):                 
    counts = {}                              
    for word in list_to_search:              
        if word in counts:                   
            counts[word] = counts[word] + 1  
        else:                                
            counts[word] = 1                 
    return counts                            

Hopefully we are boring you, but let's go through this function step by step.

1. We define a function using `def` and give it the name `counter` (line 1);
2. This function takes a single argument `list_to_search` which is the list we want to search through (line 1);
3. Next we define a variable `counts` which is an empty dictionary (line 2);
4. We loop over all words in `list_to_search` (line 3);
5. If the word is already in `counts`, we look up its current value and add 1 to it (line 4-5);
6. If the word is not in `counts` (else clause), we add the word to the dictionary and assign it the value 1 (line 6-7);
7. Return the result of counts (line 8);

Let's try out our new function!

In [52]:
print(counter(words))

{'Emma': 2, 'by': 2, 'Jane': 1, 'Austen': 1, '1816': 1, 'VOLUME': 1, 'I': 2, 'CHAPTER': 1, 'Woodhouse,': 1, 'handsome,': 1, 'clever,': 1, 'and': 5, 'rich,': 1, 'with': 2, 'a': 4, 'comfortable': 1, 'home': 1, 'happy': 1, 'disposition,': 1, 'seemed': 1, 'to': 3, 'unite': 1, 'some': 1, 'of': 8, 'the': 4, 'best': 1, 'blessings': 1, 'existence;': 1, 'had': 4, 'lived': 1, 'nearly': 1, 'twenty-one': 1, 'years': 1, 'in': 3, 'world': 1, 'very': 2, 'little': 2, 'distress': 1, 'or': 1, 'vex': 1, 'her.': 1, 'She': 1, 'was': 1, 'youngest': 1, 'two': 1, 'daughters': 1, 'most': 1, 'affectionate,': 1, 'indulgent': 1, 'father;': 1, 'had,': 1, 'consequence': 1, 'her': 4, "sister's": 1, 'marriage,': 1, 'been': 2, 'mistress': 1, 'his': 1, 'house': 1, 'from': 1, 'early': 1, 'period.': 1, 'Her': 1, 'mother': 2, 'died': 1, 'too': 1, 'long': 1, 'ago': 1, 'for': 1, 'have': 1, 'more': 1, 'than': 1, 'an': 2, 'indistinct': 1, 'remembrance': 1, 'caresses;': 1, 'place': 1, 'supplied': 1, 'excellent': 1, 'woman': 1,

---

#### Exercise 2

Let's put some of the stuff we learnt so far together. What we want you to do is to read into Python the file `data/austen-emma.txt`, convert it to a list of words and assign to the variable `emma_count` how often the word *Emma* occurs in the text.

In [53]:
emma_count = 0
# insert you code here
infile = open('data/austen-emma.txt')
text = infile.read()
infile.close()
words = text.split()
item_to_count = "Emma"
for word in words:
    if word == item_to_count:
        emma_count += 1

# The following test should print True if your code is correct 
print(emma_count == 481)

True


---

### A count function (take 2)

Let's train our function writing skills a little more. We are going to write another counting function, this time using a slightly different strategy. Recall our function `count_in_list`. It takes as argument a list and the item we want to count in that list. It returns the number of times this item occurs in the list. If we call this function for each unique word in `words`, we obtain a list of frequencies, quite similar to the one we get from the function `counter`. What would happen if we just call the function `count_in_list` on each word in `words`? 

In [54]:
infile = open('data/austen-emma-excerpt.txt')
text = infile.read()
infile.close()
words = text.split()

for word in words:
    print(word, count_in_list(word, words))

Emma 2
by 2
Jane 1
Austen 1
1816 1
VOLUME 1
I 2
CHAPTER 1
I 2
Emma 2
Woodhouse, 1
handsome, 1
clever, 1
and 5
rich, 1
with 2
a 4
comfortable 1
home 1
and 5
happy 1
disposition, 1
seemed 1
to 3
unite 1
some 1
of 8
the 4
best 1
blessings 1
of 8
existence; 1
and 5
had 4
lived 1
nearly 1
twenty-one 1
years 1
in 3
the 4
world 1
with 2
very 2
little 2
to 3
distress 1
or 1
vex 1
her. 1
She 1
was 1
the 4
youngest 1
of 8
the 4
two 1
daughters 1
of 8
a 4
most 1
affectionate, 1
indulgent 1
father; 1
and 5
had, 1
in 3
consequence 1
of 8
her 4
sister's 1
marriage, 1
been 2
mistress 1
of 8
his 1
house 1
from 1
a 4
very 2
early 1
period. 1
Her 1
mother 2
had 4
died 1
too 1
long 1
ago 1
for 1
her 4
to 3
have 1
more 1
than 1
an 2
indistinct 1
remembrance 1
of 8
her 4
caresses; 1
and 5
her 4
place 1
had 4
been 2
supplied 1
by 2
an 2
excellent 1
woman 1
as 1
governess, 1
who 1
had 4
fallen 1
little 2
short 1
of 8
a 4
mother 2
in 3
affection. 1


As you can see, we obtain the frequency of each word token in `words`, where we would like to have it only for unique word forms. The challenge is thus to come up with a way to convert our list of words into a structure with solely unique words. For this Python provides a convenient data structure called `set`. It takes as argument some iterable (e.g. a list) and returns a new object containing only unique items:

In [55]:
x = ['a', 'a', 'b', 'b', 'c', 'c', 'c']
unique_x = set(x)
print(unique_x)

{'c', 'b', 'a'}


Using `set` we can iterate over all unique words in our word list and print the corresponding frequency:

In [56]:
unique_words = set(words)
for word in unique_words:
    print(word, count_in_list(word, words))

very 2
by 2
vex 1
early 1
lived 1
VOLUME 1
CHAPTER 1
of 8
died 1
more 1
been 2
had, 1
affection. 1
existence; 1
marriage, 1
affectionate, 1
fallen 1
Emma 2
house 1
his 1
world 1
I 2
who 1
place 1
long 1
a 4
indulgent 1
the 4
period. 1
than 1
best 1
happy 1
clever, 1
1816 1
have 1
daughters 1
indistinct 1
short 1
excellent 1
her. 1
remembrance 1
handsome, 1
unite 1
rich, 1
for 1
and 5
Her 1
mother 2
Austen 1
two 1
caresses; 1
distress 1
governess, 1
an 2
twenty-one 1
too 1
mistress 1
as 1
to 3
years 1
in 3
little 2
blessings 1
seemed 1
father; 1
home 1
or 1
supplied 1
She 1
Jane 1
youngest 1
disposition, 1
consequence 1
her 4
some 1
woman 1
Woodhouse, 1
comfortable 1
was 1
sister's 1
ago 1
from 1
most 1
had 4
with 2
nearly 1


We wrap the lines of code above into the function `counter2`:

In [57]:
def counter2(list_to_search):
    unique_words = set(list_to_search)
    for word in unique_words:
        print(word, count_in_list(word, list_to_search))

A final check to see whether our function behaves correctly:

In [58]:
counter2(words)

very 2
by 2
vex 1
early 1
lived 1
VOLUME 1
CHAPTER 1
of 8
died 1
more 1
been 2
had, 1
affection. 1
existence; 1
marriage, 1
affectionate, 1
fallen 1
Emma 2
house 1
his 1
world 1
I 2
who 1
place 1
long 1
a 4
indulgent 1
the 4
period. 1
than 1
best 1
happy 1
clever, 1
1816 1
have 1
daughters 1
indistinct 1
short 1
excellent 1
her. 1
remembrance 1
handsome, 1
unite 1
rich, 1
for 1
and 5
Her 1
mother 2
Austen 1
two 1
caresses; 1
distress 1
governess, 1
an 2
twenty-one 1
too 1
mistress 1
as 1
to 3
years 1
in 3
little 2
blessings 1
seemed 1
father; 1
home 1
or 1
supplied 1
She 1
Jane 1
youngest 1
disposition, 1
consequence 1
her 4
some 1
woman 1
Woodhouse, 1
comfortable 1
was 1
sister's 1
ago 1
from 1
most 1
had 4
with 2
nearly 1


---

## Text clean up

In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function `split` is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noise. For instance, the pronoun *her* occurs 4 times, but we also find `her.` occurring 1 time and the capitalized `Her`, also 1 time. Of course we would like to add those counts to that of *her*. As it appears, the tokenization of our text using `split` is fast and simple, but it leaves us with noisy and incorrect frequency distributions. 

There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient `split` function. For now we will follow the second path.

Some words in our text are capitalized. To lowercase these words, Python provides the function `lower`. It operates on strings:

In [59]:
x = 'Emma'
x_lower = x.lower()
print(x_lower)

emma


We can apply this function to our complete text to obtain a completely lowercased text, using:

In [60]:
text_lower = text.lower()
print(text_lower)

emma by jane austen 1816

volume i

chapter i


emma woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

she was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.


This solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The function `replace` is just the function we're looking for. It takes two arguments: (1) the string we would like to replace and (2) the string we want to replace the first argument with:

In [61]:
x = 'Please. remove. all. dots. from. this. sentence.'
x = x.replace(".", "")
print(x)

Please remove all dots from this sentence


Notice that we replace all dots with an empty string written as `""`. 

---

#### Exercise 3

Write code that to lowercase and remove all commas in the following short text:

In [62]:
short_text = "Commas, as it turns out, are so much overestimated."
# insert your code here
short_text = short_text.replace(",","")
short_text = short_text.lower()

# The following test should print True if your code is correct 
print(short_text == "commas as it turns out are so much overestimated.")

True


---

We would like to remove all punctuation from a text, not just dots and commas. We will write a function called `remove_punc` that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call `replace` on the same string each time replacing a different punctuation character with an empty string. 

In [63]:
def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc(short_text))

Commas as it turns out are overestimated Dots however even more so


The second strategy we will follow is to show you that we can achieve the same result without using the built in function `replace`. Remember that a string consists of characters. We can loop over a string accessing each character in turn. Each time we find a punctuation marker we skip to the next character.

In [64]:
def remove_punc2(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    clean_text = ""
    for character in text:
        if character not in punctuation:
            clean_text += character
    return clean_text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc2(short_text))

Commas as it turns out are overestimated Dots however even more so


---

#### Exercise 4

1) Can you come up with any pros or cons for each of the two functions above?

Executed two functions the pros is the punctuaions are removed the string printed, and the cons are the uppercase is not changing to lowercase

2) Now it is time to put everything together. We want to write a function `clean_text` that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased.

In [67]:
def clean_text(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    remove_punc = ""
    for character in text:
        if character not in punctuation:
            remove_punc += character
    return remove_punc
        

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
short_text = short_text.lower()
print(clean_text(short_text) == 
      "commas as it turns out are overestimated dots however even more so")

True


3) This last excercise puts everything together. We want you to open and read the file `data/austen-emma.txt` text once more, clean up the text and recompute the frequency distribution. Assign to `woodhouse_counts` the number of times the name *Woodhouse* occurs in the text.

In [68]:
woodhouse_counts = 0
# insert your code here
infile = open('data/austen-emma.txt')
text = infile.read()
infile.close()

def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

text_lower = text.lower()
text1=clean_text(text_lower)
def counter2(list_to_search):
    unique_words = set(list_to_search)
    for word in unique_words:
        print(word, count_in_list(word, list_to_search))
words=text1.split()
item_to_count="woodhouse"
for word in words:
    if word==item_to_count:
        woodhouse_counts+=1

# The following test should print True if your code is correct 
print(woodhouse_counts == 263)

True


---

## Writing results to a file

We have accomplished a lot! You have learnt how to read files using Python from your computer, how to manipulate them, clean them up and compute a frequency distribution of the words in a text file. We will finish this chapter with explaining to you how to write your results to a file. We have already seen how to read a text from our disk. Writing to our disk is only slightly different. The following lines of code write a single sentence to the file `first-output.txt`.

In [69]:
outfile = open("first-output.txt", mode="w")
outfile.write("My first output.")
outfile.close()

Go ahead and open the file `first-output.txt` located in the folder where this course resides. As you can see it contains the line `My first output.`. To write something to a file we open, just as in the case of reading a file, a `TextIOWrapper` which can be seen as a connection to the file `first-output.txt`. The difference with opening a file for reading is the *mode* with which we open the connection. Here the mode says `w`, meaning "open the file for writing". To open a file for reading, we set the mode to `r`. However, since this is Python's default setting, we may omit it.

---

#### Exercise 5

In the final quiz of this lab we will ask you to write the frequency distribution over the words in `data/austen-emma.txt` to the file `data/austen-frequency-distribution.txt`. We will give you some code to get you started

In [72]:
# first open and read data/austen-emma.txt. Don't forget to close the infile
infile = open("data/austen-emma.txt")
text = infile.read()
infile.close()
# clean the text
def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text
text_lower=text.lower()

# next compute the frequency distribution using the function counter
def counter(list_to_search):                 
    counts = {}                              
    for word in list_to_search:              
        if word in counts:                   
            counts[word] = counts[word] + 1  
        else:                                
            counts[word] = 1                 
    return counts

frequency_distribution = counter(words)

# now open the file data/austen-frequency-distribution.txt for writing
outfile = open("data/austen-frequency-distribution.txt", mode="w")

for word, frequency in frequency_distribution.items():
    outfile.write(word + ";" + str(frequency) + '\n')
    
# close the outfile
outfile.close()

---

Ignore the following, it's just here to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>