# Working with plain text files

## Reading in a File

One of the most common formats for working with text files is the .txt format. But there are actually a number of different potential ways to work with one of these files. One of the most basic ways is to use a with statement.

In [1]:
filename = 'corpus/woolf/1915_the_voyage_out.txt'
with open(filename, 'r') as file_in:
    text = file_in.read()
print(text[0:100])

Chapter I


As the streets that lead from the Strand to the Embankment are very
narrow, it is better


For brevity's sake we only print out the first portion of the text. Notice that we open the file and assign it a new, temporary name for the duration of the statement. This ensures that the file is opened, dealt with, and then closed safely. Once we un-indent, we have closed the file, and if we tried to read the same file again we would get a ValueError for trying to work with a closed file. The 'as file_in' bit assigns it to a variable so as to help us organize what is happening (we might have another file that we are writing to. Our Woolf novel is now is now one long string, stored in a text variable. This is fine in certain cases, but we could also read the contents of it in line by line. Here is a variation on the same approach:

In [2]:
filename = 'corpus/woolf/1915_the_voyage_out.txt'
with open(filename, 'r') as file_in:
    text = file_in.readlines()
print(text[0:10])

['Chapter I\n', '\n', '\n', 'As the streets that lead from the Strand to the Embankment are very\n', 'narrow, it is better not to walk down them arm-in-arm. If you persist,\n', "lawyers' clerks will have to make flying leaps into the mud; young lady\n", 'typists will have to fidget behind you. In the streets of London where\n', 'beauty goes unregarded, eccentricity must pay the penalty, and it is\n', 'better not to be very tall, to wear a long blue cloak, or to beat the\n', 'air with your left hand.\n']


The "readlines()" function allows us to take an open file and read it line by line, returning a list of the lines. We assign that list to the text variable here, which we can now use to examine particular parts of the text. Note that here the line breaks do not correspond to sentences. Dividing longer chunks of text into sentences is a separate technique entirely, one called segmentation, that we'll get into later. For now, though, note how this means that the steps required to process your data in the way that you require depend entirely on the way in which it was encoded in the first place. In some cases, line breaks can be quite useful, like when working with poetry where the line breaks are especially meaningful:

In [3]:
filename = 'corpus/sonnets/sonnet_one.txt'
with open(filename, 'r') as file_in:
    poetry = file_in.readlines()
print(poetry[:12])
print('=====')
print(poetry[12:])

['FROM fairest creatures we desire increase,\n', "That thereby beauty's rose might never die,\n", 'But as the riper should by time decease,\n', 'His tender heir might bear his memory:\n', 'But thou, contracted to thine own bright eyes,\n', "Feed'st thy light'st flame with self-substantial fuel,\n", 'Making a famine where abundance lies,\n', 'Thyself thy foe, to thy sweet self too cruel.\n', "Thou that art now the world's fresh ornament\n", 'And only herald to the gaudy spring,\n', 'Within thine own bud buriest thy content\n', 'And, tender churl, makest waste in niggarding.\n']
=====
['Pity the world, or else this glutton be,\n', "To eat the world's due, by the grave and thee.\n"]


Calling readlines() on a piece of poetry gives us access to the whole poem as a list of lines, so we can manipulate it to separate the poem into pieces that we care about. Above, I separated the poem into two pieces, breaking off the final couplet.

Those '\n' characters might appear to be a mistake at first, but worry not! They are actually the computer's representation of a newline character, a way of knowing when a line break happens. Before we process these for analysis, we would want to process those out. One way would be to search each string and remove the character:

In [4]:
cleaned_poem = []
for line in poetry:
    clean_line = line.replace('\n', '')
    cleaned_poem.append(clean_line)

print(cleaned_poem)

['FROM fairest creatures we desire increase,', "That thereby beauty's rose might never die,", 'But as the riper should by time decease,', 'His tender heir might bear his memory:', 'But thou, contracted to thine own bright eyes,', "Feed'st thy light'st flame with self-substantial fuel,", 'Making a famine where abundance lies,', 'Thyself thy foe, to thy sweet self too cruel.', "Thou that art now the world's fresh ornament", 'And only herald to the gaudy spring,', 'Within thine own bud buriest thy content', 'And, tender churl, makest waste in niggarding.', 'Pity the world, or else this glutton be,', "To eat the world's due, by the grave and thee."]


As always, you clean and slice up any given text a number of ways. It depends on what you're interested in. But knowing you have options is often the first step.

## Breaking a text into structural components

The results above point to an important underlying problem in natural language processing: texts are never consistently formatted in such a way that they are computer ready right away. That's what puts the natural in natural language! The format of your inputs will vary a lot depending on your sources. In the case of prose, you can expect a few different categories:

1. The text is one continuous string with no line breaks.
2. The text has line breaks that correspond to the ends of the lines as laid out on a page.
3. The text has line breaks that correspond to meaningful categories.
4. Some combination of 2 and 3 (most likely).
5. The text might have other meaningful markers you could use to your advantage.

In most cases, as when working with prose, the line breaks will be used to shape the legibility of a text. Ie - they are meant to assist with the typographical layout, but they have no underlying interpretive meaning. This means that if you wish to preserve the underlying structure of a text you will need to parse the text in a more sophisticated way than just reading it in either as a lump or line by line.

If you want to grab more specific structual components, you can theoretically do that as well. But you will need something specific for Python to grab onto. If you are interested in pages as a unit of information and measurement, does your text file actually represent where a page break occurs? If not, unless you wanted to get a rough count by estimating the number of characters per page, you will have difficulty breaking pages apart in this way. If you are interested in structual information it can be worth checking your text files closely. Sometimes there might be unintentional markers that could help you out. For example:

* a stanza break might be represented by two line breaks as opposed to one
* chapter breaks might be represented by a line break followed by a line starting with "CH.
* when OCRing text, a page break might be represented by a single line with the title of the book and/or a page number

Each of these flags can sometimes help you roughly cut a text up in the ways you want. We'll work with the fisrt two for now, starting with a text file that contains a series of sonnets we want to break apart

For example:

In [1]:
# this text file contains a series of sonnets. two line breaks between each sonnet.

with open('corpus/sonnets_combined.txt', 'r') as filein:
    text = filein.read()

# splitting apart by two line breaks, then, can give us 
# access to a list of the sonnets so we could work with them individually.
sonnets = text.split('\n\n')
for sonnet in sonnets:
    print('======')
    print(sonnet)

FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy b

To break apart by chapters we will use a different library to do the splitting - the re library will give us access to regular expressions. Regular expressions are a complicated way to define a search pattern for a string. They give you a lot of control, but the syntax can be difficult to get a handle on. Below, we say "look at the string we've given you and split it apart whenever you find the characters CHAPTER followed by one or more capital letters from A-Z that are also followed by two \n (new line) characters. Note also that splitting a string in this way throws away the things that it actually matches against. So the chapter headers themselves will no longer be in our text. You may or may not want that to be the case. 

In [4]:
import re
# our copy of Night and Day by Woolf lists chapter headers, so we could split on the chapter headers to break the text apart.

with open('corpus/woolf/1919_night_and_day.txt', 'r') as filein:
    text = filein.read()

# split apart by chapter headers
text[0:100]
chapters = re.split('CHAPTER [A-Z]+\n\n', text)
len(chapters)

35

This might appear to have done what we wanted, but if you look at the text itself you find that the novel has 34 chapters, not 35. We can check our work by looking at the first 100 characters of each chapter:

In [24]:
for chapter in chapters:
    print('======')
    print(chapter[0:99])


It was a Sunday evening in October, and in common with many other young
ladies of her class, Kathar
The young man shut the door with a sharper slam than any visitor had
used that afternoon, and walke
Denham had accused Katharine Hilbery of belonging to one of the most
distinguished families in Engl
At about nine o'clock at night, on every alternate Wednesday, Miss Mary
Datchet made the same resol
Denham had no conscious intention of following Katharine, but, seeing
her depart, he took his hat a
Of all the hours of an ordinary working week-day, which are the
pleasantest to look forward to and 
"And little Augustus Pelham said to me, 'It's the younger generation
knocking at the door,' and I s
She took her letters up to her room with her, having persuaded her
mother to go to bed directly Mr.
Katharine disliked telling her mother about Cyril's misbehavior quite as
much as her father did, an
Messrs. Grateley and Hooper, the solicitors in whose firm Ralph Denham
was clerk, had their office 

This shows us that the first item in our chapters list has nothing in it. A glance at the text file shows that this is because the text starts right off with a chapter header. So splitting there leaves an empty item in the list. We can cut that empy item out like so:

In [25]:
chapters = chapters[1:]
print(len(chapters))

34


Always a good idea to check the contents of your variables as you're going to ensure that they match up with your understanding of what the text actually is. And also a good idea to double check the text itself to ensure that it consistently uses whatever markers you're planning to use. Chapter headers might have worked in this instance, but what if the text had an afterword? the system would have broken down? Things are rarely simple with this work.

## Reading in multiple files to preserve structure

In some cases, you might already have your various texts separated as a series of text files. You can read these into Python in such a way that preserves this structure. In the case of a book of poetry, you might, for example, separate each poem into its own text file. And, as poetry cares about lines as a meaningful unit outside of typography, using the readlines() command might actually get you useful information. You can combine these two methods.

The sonnets folder has a set of five Shakespearean sonnets in it. Combining what we've learned already, we can gather a list of filenames, read each text in, and then store the collection in a variable for manipulation.

In [9]:
import glob

filenames = glob.glob('corpus/sonnets/*.txt')
sonnets = []
for filename in filenames:
    with open(filename, 'r') as file_in:
        sonnets.append(file_in.readlines())

We now have a list called sonnets, but it's more properly understood as a list of lists, or a list in which each item is itself a list of more items:

* List level one: sonnet level.
* List level two (sub-list): line level.

That can be confusing, but the important thing is that we've preserved the structure of our poetry. We have a set of poems, and each of those poems can be further broken down into their individual lines. We can manipulate this hierarchy to access different elements of the list. This will give us the first sonnet.

In [10]:
print(sonnets[0])

['When forty winters shall besiege thy brow,\n', "And dig deep trenches in thy beauty's field,\n", "Thy youth's proud livery so gazed on now,\n", "Will be a totter'd weed of small worth held:\n", 'Then being asked, where all thy beauty lies,\n', 'Where all the treasure of thy lusty days;\n', 'To say, within thine own deep sunken eyes,\n', 'Were an all-eating shame, and thriftless praise.\n', "How much more praise deserv'd thy beauty's use,\n", "If thou couldst answer 'This fair child of mine\n", "Shall sum my count, and make my old excuse,'\n", 'Proving his beauty by succession thine!\n', 'This were to be new made when thou art old,\n', "And see thy blood warm when thou feel'st it cold.\n"]


And this will give us the first few lines of the third sonnet only.

In [11]:
print(sonnets[2][0:5])

['Unthrifty loveliness, why dost thou spend\n', "Upon thy self thy beauty's legacy?\n", "Nature's bequest gives nothing, but doth lend,\n", 'And being frank she lends to those are free:\n', 'Then, beauteous niggard, why dost thou abuse\n']


Organizing things manually like this gives you great control over the files you're working with and their underlying structure. When working with a large corpus you might not have such an option. Separating files by hand is feasible when working with a few texts, but when working with thousands of documents you either have to work with what they give you or develop some computational way of recovering the structure of your text. In these cases, you might rely on textual markers to pinpoint sections of a text as noted above. Each different NLP situation presents its own set of problems. But, if you do this work enough, you'll find that a common set of methods can get you quite far.