# Working with TEI

## Getting Data from TEI

The [Text Encoding Initiative (TEI)](https://tei-c.org/) is one of the longest standing and robust digital humanities projects. Thanks to the work of the TEI consortium, scholars have extensive tools at their disposal for encoding the nuances of textual data in a systematic way using the TEI guidelines. Once a text has been organized and encoded, it can sometimes be difficult to tell what might come next for your encoded and shared documents. There are number of possibilities related to display and archiving, but let's take a look at the analytical affordances of such structured data for natural language processing.

TEI is a flavor of XML, and we can use the same tools to access it that we used for HTML in the [scraping chapter](scraping.ipynb). First let's read in some text. We'll be using a TEI encoded version of Dostoevsky's The Brothers Karamazov, [available from Project Gutenberg](http://www.gutenberg.org/browse/authors/d#a314).

In [1]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# store the filename of the text.
filename = 'corpus/brothers_karamazov.tei'

# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei.
tei = BeautifulSoup(text, 'xml')

We know have access to the encoded text and can manipulate it in much the same way that we would an HTML file. It is worth noting that knowing exactly _what_ to query for depends to a large degree on knowledge of your object of study. No two TEI files will be formatted or encoded in exactly the same way, so you will need to closely examine your materials early on. Let's take a look at the first part of the TEI file.

In [2]:
tei.teiHeader

<teiHeader>
<fileDesc>
<titleStmt>
<title>The Brothers Karamazov</title>
<author><name reg="Dostoyevsky, Fyodor">Fyodor Dostoyevsky</name></author>
</titleStmt>
<editionStmt>
<edition n="1">Edition 1</edition>
</editionStmt>
<publicationStmt>
<publisher>Project Gutenberg</publisher>
<date>February 12, 2009</date>
<idno type="etext-no">28054</idno>
<availability>
<p>This eBook is for the use of anyone anywhere at no cost and
        with almost no restrictions whatsoever. You may copy it, give it
        away or re-use it under the terms of the Project Gutenberg
        License online at www.gutenberg.org/license</p>
</availability>
</publicationStmt>
<sourceDesc>
<bibl>
        Created electronically.
      </bibl>
</sourceDesc>
</fileDesc>
<encodingDesc>
</encodingDesc>
<profileDesc>
<langUsage>
<language id="en"/>
<language id="fr"/>
<language id="la"/>
<language id="de"/>
</langUsage>
</profileDesc>
<revisionDesc>
<change>
<date value="2009-02-12">February 12, 2009</date>
<respStmt>

Note that the source material presented tags to us in camel case, with certain letters capitalized. We had to preserve these same conventions when querying the TEI for the teiHeader. Neglecting to do so would return no results, as in the example below:

In [3]:
tei.teiheader

But if we had used a different parser to work with the TEI, a la

BeautifulSoup(text, 'lxml')

instead of

BeautifulSoup(text, 'xml')

The result would have actually flattened out all our capitalization, meaning that preserving the capitalization in our queries would have returned nothing! The bottom line - know your data and know the methods you're using to work with it. Looking at the TEI would tell us that a teiHeader tag exists with data in it, so it must be there somewhere. We just had to query for it in the correct manner.

Using this knowledge, we could work backwards from the TEI tags to build up a workable text. Let's go ahead and re-import the text with a different parser so that we don't have to worry about capitalization:

In [14]:
# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei. then look for all paragraphs
tei = BeautifulSoup(text, 'lxml')
paragraphs =  tei.find_all('p')
paragraphs[101]

<p>
<q>Fyodor Pavlovitch, for the last time, your compact, do you
hear? Behave properly or I will pay you out!</q> Miüsov had time to
mutter again.
</p>

We got a bunch of paragraphs, but note that the TEI Header is still captured in what we have so far. We would need to take care to get rid of it or keep it, depending on our goals. Using this same approach we could pull out all the text of those paragraphs, stripping away the tags using the .text function available to beautiful soup tags.

In [15]:
text_of_paragraphs = [paragraph.text for paragraph in paragraphs]
text_of_paragraphs[101]

'\nFyodor Pavlovitch, for the last time, your compact, do you\nhear? Behave properly or I will pay you out! Miüsov had time to\nmutter again.\n'

Of course, the whole point of TEI is that we actually care about the ways that the tags are interacting with the text itself. In our previous example, the encoder used the `<q>` tag to mark [quoted material](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-q.html).

In [16]:
paragraphs[101]

<p>
<q>Fyodor Pavlovitch, for the last time, your compact, do you
hear? Behave properly or I will pay you out!</q> Miüsov had time to
mutter again.
</p>

We could use this tag to pull out all similar pieces of text that represent a moment of rhetorical distancing:

In [17]:
# get all quoted bits
quotes = [q.text for q in tei.find_all('q')]
# print the number of questions
len(quotes)
print(quotes[0:9])

['landowner', 'romantic', "One would think that you'd got a promotion, Fyodor Pavlovitch,\nyou seem so pleased in spite of your sorrow,", 'Lord, now lettest Thou Thy\nservant depart in peace,', 'clericals.', 'Those innocent eyes slit my soul up like a razor,', 'from the halter,', 'wronged', 'possessed\nby devils.']


The number of things that you'll be able to pull out of any particular text depends, ultimately, on the encoding itself. Marking a text up at all, but especially with TEI, is a deeply interpretive act. And any encoding project will have its own interests. You'll want to look closely at your encoded text to get a sense of the options for you. With the Dostoevsky text we could do many things, but the most basic might involve looking at the attributes of a tag to get a clearer sense of particular pieces of text. We could, for instance, look at the times that parts of the text were tagged as "foreign" to get a sense of the linguistic diversity in the text.

In [19]:
import nltk

# find all instances of language marked as foreign
foreign_text = tei.find_all('foreign')
# get the lang attribute for each tag, where the encoder has stored information about the language of the text. 
foreign_text[0].get('lang')
# get the text
language_markings = [instance.get('lang') for instance in foreign_text]
# get a set consisting of all the unique language flags.
set(language_markings)
nltk.FreqDist(language_markings)

FreqDist({None: 83, 'fr': 54, 'de': 11, 'la': 5})

There are only three languages marked for the text. Of those markings, French phrases far exceed the number of German or Latin. But 'None' is even more frequent. This might be an opportunity to clean up the TEI, as we could go through and assign language categories to those tags manually. This is a good reminder that the results of your text analysis should never be taken for granted. They're the results of human intervention, interpretation, and error all the way down.

We might use this information to pull out all the text of a particular language:

In [20]:
french_snippets = [instance.text for instance in foreign_text if instance.get('lang') == 'fr']
french_snippets[0:9]

['Il faudrait les\ninventer',
 "J'ai bu l'ombre d'un\ncocher qui avec l'ombre d'une brosse frottait l'ombre d'une carrosse.",
 'un\nchevalier parfait',
 'chevalier',
 'arrière-pensée',
 "coup d'état",
 'poseurs',
 'plus de noblesse que de sincérité',
 'plus de\nsincérité que de noblesse']

In [26]:
unknown_snippets = [instance.text for instance in foreign_text if instance.get('lang') == None]
unknown_snippets[0:30]

['à propos',
 'auto da\nfé',
 'auto da fé',
 'quiproquo',
 'auto da fé',
 'Panie',
 'panie',
 'panie',
 'panie',
 'panie',
 'panie',
 'Pani',
 'Panie!',
 'pan',
 'panovie',
 'panie',
 'pan',
 'panovie!',
 'panovie',
 'panovie',
 'panovie',
 'Panie!',
 'Panovie',
 'panie',
 'panovie',
 'panovie',
 'pani',
 'panie',
 'panie',
 'panie']

Here we go through the text and use the 'lang' attribute to check whether a particular snippet is of the language we care about - French in the first case. In the second case, we use that same attribute to search for instances where the language is tagged as 'foreign' but the language is not marked. Looks like a combination of French and Polish, among other things. With this data in hand we could clean up our work or carry out new analyses.

## Using TEI to Perform More Nuanced Searches

TEI can allow us to access more nuanced levels of context in a work than might otherwise be available in NLTK. As we've seen elsewhere in this cookbook, when dividing or searching within a text it is necessary to have something to grab onto in your script. The kinds of rich context provided by TEI can be used to make our searches more flexible than they would otherwise be.

The code below finds a specific word in an XML document and returns the word within a specified context from within the work. The size of the contextualized result is ultimately up to you, and will change depending on the type of work you are consulting and the XML/TEI schema used. The example text, Homer's Iliad as translated by Alexander Pope [available from Project Gutenburg](https://www.gutenberg.org/ebooks/6130), uses paragraph divisions for the introduction and line divisions for the main text. Our inquiry will be within the main text so the code is written assuming line divisions. 

This code performs a similar function to NLTK's concordance feature, in that, both return the word of interest within a contextual window. Like NLTK's text.concordance(), you are able to adjust the size of the window to expand or constrict the character frame around your word of interest. 

Python's NLTK, however, assumes plain text input. This means that although you may change the size of the window, you are limited to framing that window solely out of individual characters. Although you can retrieve all instances of the word of interest within a specified context, you cannot account for the architecture of the text and return the word within its literary home, be that a line, sentence, or paragraph. 

Working with a tagged document allows you to use the document's schema to return a word of interest within its literary environment. If NLTK's concordance function builds its window out of characters, the code below builds its frame through a specified XML tag. The resulting contextual window utilizes the organization of the text, allowing the reader to view the word within a more semantically meaningful environment. This process can help facilitate movement between reading at different scales, as [Ryan Cordell](https://ryancordell.org/research/scale-as-deformance/) and others have theorized.

Below is an example of the results of running NLTK's concordance function for "Apollo" on the text, once it has been turned into a state that can be worked by NLTK. Notice that we are unable to distinguish between the introduction and the main text, and that the contextual frame consists solely of a specified number of characters on either side of our word. 

The concordance function is useful, but it chops sentences and lines. Depending on the nature of your research, this might make it difficult to get a sense of how Apollo functions within the text, how he is characterized, or the scenes in which he appears. You also might be interested in much larger contexts. By returning a word of interest in a structurally meaningful context, our contextual window is able to better situated our word of interest in a semantically significant environment. The following script, then, can be helpful if one wants to leverage the structure of an XML/TEI tagged document in answering research questions more common to "close reading" strategies. NLTK measures context by character. In what follows, we use the TEI markup to shift that context to be the line grouping, represented in TEI by the <lg> tag. So if a word of interest occurs in a given line, the script will return a specified number of line groupings on either side of it. We start by reading in the document and processing the TEI.

In [27]:
# import the Natural Language Toolkit and the Beautiful Soup library
import nltk
from bs4 import BeautifulSoup

# store the text's filepath
filename = 'corpus/iliad.tei'

# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei.
tei = BeautifulSoup(text, 'lxml')

Next, we will store the text divisions we are interested in according to the tags used by our TEI schema:

In [28]:
lgs = tei.find_all('lg')

The next step is to call NLTK to tokenize the content of the tags: 

In [29]:
# make a blank list for lines
line_tokens = []

# loop over the lines, tokenize their content, append the tokens to the blank list
for lg in lgs:
    tokens = nltk.word_tokenize(lg.text)
    line_tokens.append(tokens)

line_tokens[0]

["''",
 "'Tuque",
 'prior',
 ',',
 'tu',
 'parce',
 'genus',
 'qui',
 'ducis',
 'Olympo',
 ',',
 'Projice',
 'tela',
 'manu',
 'sanguis',
 'meus',
 "'"]

In [30]:
word_of_interest = 'Apollo'

Store our contextual parameters and loop over token list: 

In [31]:
# make a blank list to store results
contexts_of_word_of_interest = []

# store context paramater
context = 1

# loop over line_tokens list, retrieve the index and the value of each iteration of word_of_interest
for num, line in enumerate(line_tokens, start=0):
    if line.count(word_of_interest)>0:
        # append the contextualized index according to the context parameters
        start = num - context
        # add one to the top because Python slicing is exclusive on the top end and inclusive on the bottom.
        end = num + context + 1
        line_tokens[start:end]
        contexts_of_word_of_interest.append(line_tokens[start:end])

We now have a list of lists, which can be confusing. Each item in the list contexts_of_word_of_interest represents a new context, each instance in which the word 'Apollo' shows up in our text. And for each context, we have a series of line groupings. So the center of the context contains the line group in which "Apollo" occurs, and we also get a window of the text with a specified number of line groupings on either side. For ease of reading, the following print statement using '======' is helpful for dividing the output. 

In [32]:
for line in contexts_of_word_of_interest:
    print('======')
    print(line)

[['Achilles', "'", 'wrath', ',', 'to', 'Greece', 'the', 'direful', 'spring', 'Of', 'woes', 'unnumber', "'d", ',', 'heavenly', 'goddess', ',', 'sing', '!', 'That', 'wrath', 'which', 'hurl', "'d", 'to', 'Pluto', "'s", 'gloomy', 'reign', 'The', 'souls', 'of', 'mighty', 'chiefs', 'untimely', 'slain', ';', 'Whose', 'limbs', 'unburied', 'on', 'the', 'naked', 'shore', ',', 'Devouring', 'dogs', 'and', 'hungry', 'vultures', 'tore.Vultures', ':', 'Pope', 'is', 'more', 'accurate', 'than', 'the', 'poet', 'he', 'translates', ',', 'for', 'Homer', 'writes', '``', 'a', 'prey', 'to', 'dogs', 'and', 'to', 'all', 'kinds', 'of', 'birds', '.', 'But', 'all', 'kinds', 'of', 'birds', 'are', 'not', 'carnivorous', '.', 'Since', 'great', 'Achilles', 'and', 'Atrides', 'strove', ',', 'Such', 'was', 'the', 'sovereign', 'doom', ',', 'and', 'such', 'the', 'will', 'of', 'Jove', '!', '—i.e', '.', 'during', 'the', 'whole', 'time', 'of', 'their', 'striving', 'the', 'will', 'of', 'Jove', 'was', 'being', 'gradually', 'acco

Something like this can be useful for pulling apart the text into smaller units for analysis, using the power of TEI to get a closer look at elements of the text than might otherwise be legible to Python. It can also be useful for facilitating levels of reading that don't easily correspond to structural elements already represented in the text. We covered in other chapters how you might divide a text into equal units or chapters based on chapter headings. Methods like this can help you get a rough cut of "passages that might be relevant to the study of the character Apollo" by using the affordances of TEI.