Similar to chapter 10, this chapter is also text/idea heavy. It focuses on the maintenance of linguistic data, or corpora. 

While this notebook includes code cells for examples of data format conversions, data extraction, and other tasks involved in managing linguistic data, it is recommended to go through the chapter in the book to gain a better understanding of the major ideas involved.

https://www.nltk.org/book/ch11.html

# Managing Linguistic Data

## Corpus Structure: A Case Study

NLTK includes a sample from the TIMIT corpus. You can access its documentation in the usual way, using help(nltk.corpus.timit). Print nltk.corpus.timit.fileids() to see a list of the 160 recorded utterances in the corpus sample.

In [5]:
import nltk

In [3]:
nltk.corpus.timit.fileids()[:10]

['dr1-fvmh0/sa1.phn',
 'dr1-fvmh0/sa1.txt',
 'dr1-fvmh0/sa1.wav',
 'dr1-fvmh0/sa1.wrd',
 'dr1-fvmh0/sa2.phn',
 'dr1-fvmh0/sa2.txt',
 'dr1-fvmh0/sa2.wav',
 'dr1-fvmh0/sa2.wrd',
 'dr1-fvmh0/si1466.phn',
 'dr1-fvmh0/si1466.txt']

Each item has a phonetic transcription which can be accessed using the phones() method.

In [7]:
phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1')
print(phonetic)

['h#', 'sh', 'iy', 'hv', 'ae', 'dcl', 'y', 'ix', 'dcl', 'd', 'aa', 'kcl', 's', 'ux', 'tcl', 'en', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax', 'q', 'ao', 'l', 'y', 'ih', 'ax', 'h#']


In [8]:
nltk.corpus.timit.word_times('dr1-fvmh0/sa1')

[('she', 7812, 10610),
 ('had', 10610, 14496),
 ('your', 14496, 15791),
 ('dark', 15791, 20720),
 ('suit', 20720, 25647),
 ('in', 25647, 26906),
 ('greasy', 26906, 32668),
 ('wash', 32668, 37890),
 ('water', 38531, 42417),
 ('all', 43091, 46052),
 ('year', 46052, 50522)]

In addition to this text data, TIMIT includes a lexicon that provides the canonical pronunciation of every word, which can be compared with a particular utterance:

In [9]:
timitdict = nltk.corpus.timit.transcription_dict()
timitdict['greasy'] + timitdict['wash'] + timitdict['water']

['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't', 'axr']

In [10]:
phonetic[17:30]

['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax']

In [11]:
# Accessing demographic data of a particular ite
nltk.corpus.timit.spkrinfo('dr1-fvmh0')

SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')

  ## Acquiring Data
  
  ### Obtaining Data from the Word Processor Files
  
Consider the following fragment of a lexical entry: "sleep [sli:p] v.i. condition of body and mind...". We can enter this in MSWord, then "Save as Web Page", then inspect the resulting HTML file:

```
<p class=MsoNormal>sleep
  <span style='mso-spacerun:yes'> </span>
  [<span class=SpellE>sli:p</span>]
  <span style='mso-spacerun:yes'> </span>
  <b><span style='font-size:11.0pt'>v.i.</span></b>
  <span style='mso-spacerun:yes'> </span>
  <i>a condition of body and mind ...<o:p></o:p></i>
</p>
```

In [27]:
import re

In [17]:
legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])
pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")
document = open("data/dict.htm", encoding="windows-1252").read()
document

"<p class=MsoNormal>sleep\n  <span style='mso-spacerun:yes'> </span>\n  [<span class=SpellE>sli:p</span>]\n  <span style='mso-spacerun:yes'> </span>\n  <b><span style='font-size:11.0pt'>v.i.</span></b>\n  <span style='mso-spacerun:yes'> </span>\n  <i>a condition of body and mind ...<o:p></o:p></i>\n</p>"

We can inspect the used pos tags as follows:

In [18]:
used_pos = set(re.findall(pattern, document))

In [19]:
used_pos

{'v.i.'}

Once we know the data is correctly formatted, we can write other programs to convert the data into a different format.

### Obtaining Data from Spreadsheets and Databases

Our program might perform a linguistically motivated query which cannot be expressed in SQL, e.g. select all words that appear in example sentences for which no dictionary entry is provided. For this task, we would need to extract enough information from a record for it to be uniquely identified, along with the headwords and example sentences. Let's suppose this information was now available in a CSV file dict.csv:

```
"sleep","sli:p","v.i","a condition of body and mind ..."
"walk","wo:k","v.intr","progress by lifting and setting down each foot ..."
"wake","weik","intrans","cease to sleep"
```

We can express this query in code below:

In [20]:
import csv

In [28]:
lexicon = csv.reader(open('data/dict.csv'))

In [29]:
pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon]

In [33]:
lexemes, defns = zip(*pairs)
defn_words = set(w for defn in defns for w in defn.split())
print(defn_words)

{'each', 'down', 'mind', 'and', 'body', 'setting', 'cease', 'lifting', 'condition', 'foot', 'by', '...', 'to', 'progress', 'sleep', 'a', 'of'}


In [35]:
print(sorted(defn_words.difference(lexemes)))

['...', 'a', 'and', 'body', 'by', 'cease', 'condition', 'down', 'each', 'foot', 'lifting', 'mind', 'of', 'progress', 'setting', 'to']


We can use this information to enrich the lexicon.

### Converting Data Formats

Consider a case where we had to have a build an inverted index using some input data. 

We can continue the use of the data from the above example, and consider an example where we have to construct an index that maps the words of a dictionary definition to the corresponding index.

In [36]:
idx = nltk.Index((defn_word, lexeme) 
                 for (lexeme, defn) in pairs 
                 for defn_word in nltk.word_tokenize(defn) 
                 if len(defn_word) > 3)

In [37]:
idx

Index(list,
      {'condition': ['sleep'],
       'body': ['sleep'],
       'mind': ['sleep'],
       'progress': ['walk'],
       'lifting': ['walk'],
       'setting': ['walk'],
       'down': ['walk'],
       'each': ['walk'],
       'foot': ['walk'],
       'cease': ['wake'],
       'sleep': ['wake']})

In [40]:
with open("output/dict.idx", "w") as idx_file:
     for word in sorted(idx):
            idx_words = ', '.join(idx[word])
            idx_line = "{}: {}".format(word, idx_words)
            print(idx_line, file=idx_file)

### Lookup by Pronunciation Similarity

First, we identify confusible letter sequences and map complex versions to simpler versions.

In [41]:
mappings = [('ph', 'f'), ('ght', 't'), ('^kn', 'n'), ('qu', 'kw'),
             ('[aeiou]+', 'a'), (r'(.)\1', r'\1')]

In [42]:
def signature(word):
    for patt, repl in mappings: 
        word = re.sub(patt, repl, word)
    pieces = re.findall("[^aeiou]+", word)
    return "".join(char for piece in pieces for char in sorted(piece))[:8]

In [43]:
signature("illefent")

'lfnt'

In [44]:
signature("elephant")

'lfnt'

In [45]:
signature("ebsekwieous")

'bskws'

In [46]:
signature("nuculerr")

'nclr'

Next, we create a mapping from signatures to words, for all words in our lexicon.

In [47]:
signatures = nltk.Index((signature(w), w) for w in nltk.corpus.words.words())

In [49]:
print(signatures[signature("nuclerr")])

['anicular', 'inocular', 'nucellar', 'nuclear', 'unicolor', 'uniocular', 'unocular']


Finally, we rank the results in terms of similarity to the original word.

In [51]:
def rank(word, wordlist):
    ranked = sorted((nltk.edit_distance(word, w), w) for w in wordlist)
    return [word for (_, word) in ranked]

In [52]:
def fuzzy_spell(word):
    sig = signature(word)
    if sig in signatures:
        return rank(word, signatures[sig])
    return []

In [53]:
fuzzy_spell("illefent")

['olefiant', 'elephant', 'oliphant', 'elephanta']

In [54]:
fuzzy_spell("nuclar")

['nuclear',
 'nucellar',
 'anicular',
 'inocular',
 'unocular',
 'unicolor',
 'uniocular']

In [59]:
fuzzy_spell("eple")[0]

'apple'

In the above example, a simple program can facilitate access to lexical data in a context where the writing system of a language may not be standardized, or where users of the language may not have a good command of spellings.

## Working with XML

### The ElementTree Interface

Let's illustrate the use of ElementTree using a collection of Shakespeare plays that have been formatted using XML.

In [6]:
# Loading the xml file
merchant_file = nltk.data.find("corpora/shakespeare/merchant.xml")
raw = open(merchant_file).read()

In [5]:
# Inspecting contents
print(raw[:163])

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="shakes.css"?>
<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->

<PLAY>
<TITLE>The Merchant of Venice</TITLE>


In [6]:
print(raw[1789:2006])

<TITLE>ACT I</TITLE>

<SCENE><TITLE>SCENE I.  Venice. A street.</TITLE>
<STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>

<SPEECH>
<SPEAKER>ANTONIO</SPEAKER>
<LINE>In sooth, I know not why I am so sad:</LINE>


As you can see above, we have accessed the XML data as string. The next step is to process the file contents as structured XML data using `ElementTree`.

In [2]:
from xml.etree.ElementTree import ElementTree

In [7]:
merchant = ElementTree().parse(merchant_file)
merchant

<Element 'PLAY' at 0x1a175d3c50>

In [9]:
merchant[0]

<Element 'TITLE' at 0x1a219b5650>

In [10]:
merchant[0].text

'The Merchant of Venice'

In [11]:
merchant.getchildren()

  """Entry point for launching an IPython kernel.


[<Element 'TITLE' at 0x1a219b5650>,
 <Element 'PERSONAE' at 0x1a219b56b0>,
 <Element 'SCNDESCR' at 0x1a219b9350>,
 <Element 'PLAYSUBT' at 0x1a219b93b0>,
 <Element 'ACT' at 0x1a219b9410>,
 <Element 'ACT' at 0x1a219e0dd0>,
 <Element 'ACT' at 0x1a21a16230>,
 <Element 'ACT' at 0x1a21a4c230>,
 <Element 'ACT' at 0x1a21a73e90>]

The play consists of a title, the personae, a scene description, a subtitle, and five acts. Each act has a title and some scenes, and each scene consists of speeches which are made up of lines, a structure with four levels of nesting. Let's dig down into Act IV:

In [12]:
merchant[-2][0].text

'ACT IV'

In [14]:
merchant[-2][1]

<Element 'SCENE' at 0x1a21a4c2f0>

In [15]:
merchant[-2][1][0].text

'SCENE I.  Venice. A court of justice.'

In [16]:
merchant[-2][1][54]

<Element 'SPEECH' at 0x1a21a59d70>

In [17]:
merchant[-2][1][54][0]

<Element 'SPEAKER' at 0x1a21a59dd0>

In [18]:
merchant[-2][1][54][0].text

'PORTIA'

In [19]:
merchant[-2][1][54][1]

<Element 'LINE' at 0x1a21a59e30>

In [20]:
merchant[-2][1][54][1].text

"The quality of mercy is not strain'd,"

We can iterate over the types we are interested in (ex: ACTS) using `merchant.findall('ACT')`. Below is an example of doing tag-specific searches at every level of nesting:

In [22]:
for i, act in enumerate(merchant.findall('ACT')):
    for j, scene in enumerate(act.findall('SCENE')):
        for k, speech in enumerate(scene.findall('SPEECH')):
            for line in speech.findall('LINE'):
                if 'music' in str(line.text):
                    print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))

Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
Act 5 Scene 1 Speech 23: And draw her home with music.
Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
Act 5 Scene 1 Speech 25: The man that hath no music in himself,
Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
Act 5 Scene 1 Speech 32: No better a musician than the wren.


Instead of explicitly navigating down the hierarchy, we can also search for embedded elements. 

The example below examines the sequence of speakers to build a frequency distribution for speakers:

In [8]:
from collections import Counter
speaker_seq = [s.text for s in merchant.findall("ACT/SCENE/SPEECH/SPEAKER")]

In [10]:
speaker_seq[:5]

['ANTONIO', 'SALARINO', 'SALANIO', 'SALARINO', 'ANTONIO']

In [11]:
speaker_freq = Counter(speaker_seq)
top5 = speaker_freq.most_common(5)
top5

[('PORTIA', 117),
 ('SHYLOCK', 79),
 ('BASSANIO', 73),
 ('GRATIANO', 48),
 ('ANTONIO', 47)]

We can also look for patterns in which speakers follows who in the dialogues. We'll only do this for the top 5 speakers.

In [12]:
from collections import defaultdict
abbreviate = defaultdict(lambda: 'OTH') # OTH = Other than top 5
for speaker, _ in top5:
    abbreviate[speaker] = speaker[:4]

In [13]:
speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
cfd.tabulate()

     ANTO BASS GRAT  OTH PORT SHYL 
ANTO    0   11    4   11    9   12 
BASS   10    0   11   10   26   16 
GRAT    6    8    0   19    9    5 
 OTH    8   16   18  153   52   25 
PORT    7   23   13   53    0   21 
SHYL   15   15    2   26   21    0 


### Using ElementTree for Accessing Toolbox Data

We can use the toolbox.xml() method to access a Toolbox file and load it into an elementtree object. This file contains a lexicon for the Rotokas language of Papua New Guinea.

In [14]:
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')

There are two ways to access the contents of the lexicon object, by indexes and by paths. Indexes use the familiar syntax, thus lexicon[3] returns entry number 3.

In [15]:
lexicon[3][0]

<Element 'lx' at 0x1a176cddd0>

In [16]:
lexicon[3][0].tag

'lx'

In [17]:
lexicon[3][0].text

'kaa'

The second way to access the contents of the lexicon object uses paths. The lexicon is a series of record objects, each containing a series of field objects, such as lx and ps. We can conveniently address all of the lexemes using the path record/lx.

In [19]:
[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')][:10]

['kaa',
 'kaa',
 'kaa',
 'kaakaaro',
 'kaakaaviko',
 'kaakaavo',
 'kaakaoko',
 'kaakasi',
 'kaakau',
 'kaakauko']

Let's view the Toolbox data in XML format. 

In [20]:
import sys
from nltk.util import elementtree_indent
from xml.etree.ElementTree import ElementTree
elementtree_indent(lexicon)
tree = ElementTree(lexicon[3])
tree.write(sys.stdout, encoding="unicode")

<record>
    <lx>kaa</lx>
    <ps>N</ps>
    <pt>MASC</pt>
    <cl>isi</cl>
    <ge>cooking banana</ge>
    <tkp>banana bilong kukim</tkp>
    <pt>itoo</pt>
    <sf>FLORA</sf>
    <dt>12/Aug/2005</dt>
    <ex>Taeavi iria kaa isi kovopaueva kaparapasia.</ex>
    <xp>Taeavi i bin planim gaden banana bilong kukim tasol long paia.</xp>
    <xe>Taeavi planted banana in order to cook it.</xe>
  </record>

## Working with Toolbox Data

In [22]:
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')

Computing the average number of fields for each entry:

In [23]:
sum(len(entry) for entry in lexicon) / len(lexicon)

13.635955056179775

In this section we will discuss two tasks that arise in the context of documentary linguistics, neither of which is supported by the Toolbox software.



### Adding a Field to Each Entry

It is often convenient to add new fields that are derived automatically from existing ones. Such fields often facilitate search and analysis. 

For instance below, we define a function cv() which maps a string of consonants and vowels to the corresponding CV sequence, e.g. kakapua would map to CVCVCVV.

In [24]:
from xml.etree.ElementTree import SubElement

def cv(s):
    s = s.lower()
    s = re.sub(r'[^a-z]',     r'_', s)
    s = re.sub(r'[aeiou]',    r'V', s)
    s = re.sub(r'[^V_]',      r'C', s)
    return (s)

def add_cv_field(entry):
    for field in entry:
        if field.tag == 'lx':
            cv_field = SubElement(entry, 'cv')
            cv_field.text = cv(field.text)

In [30]:
lexicon = toolbox.xml('rotokas.dic')
add_cv_field(lexicon[53])

In [31]:
print(nltk.toolbox.to_sfm_string(lexicon[53]))

\lx kaeviro
\ps V
\pt A
\ge lift off
\ge take off
\tkp go antap
\sc MOTION
\vx 1
\nt used to describe action of plane
\dt 03/Jun/2005
\ex Pita kaeviroroe kepa kekesia oa vuripierevo kiuvu.
\xp Pita i go antap na lukim haus win i bagarapim.
\xe Peter went to look at the house that the wind destroyed.
\cv CVVCVCV



### Validating a Toolbox Lexicon

Many lexicons in Toolbox format do not conform to any particular schema. Some entries may include extra fields, or may order existing fields in a new way. Instead of manually inspecting thousands of lexical entries, we can identify frequent field sequences with the help of a `Counter`:

In [33]:
from collections import Counter
field_sequences = Counter(':'.join(field.tag for field in entry) for entry in lexicon)

In [36]:
field_sequences.most_common()[:5]

[('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41),
 ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37),
 ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27),
 ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20),
 ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe', 17)]

Now, we can use a grammar to validate entries when we iterate over them. 

In [38]:
grammar = nltk.CFG.fromstring('''
  S -> Head PS Glosses Comment Date Sem_Field Examples
  Head -> Lexeme Root
  Lexeme -> "lx"
  Root -> "rt" |
  PS -> "ps"
  Glosses -> Gloss Glosses |
  Gloss -> "ge" | "tkp" | "eng"
  Date -> "dt"
  Sem_Field -> "sf"
  Examples -> Example Ex_Pidgin Ex_English Examples |
  Example -> "ex"
  Ex_Pidgin -> "xp"
  Ex_English -> "xe"
  Comment -> "cmt" | "nt" |
  ''')

In [39]:
def validate_lexicon(grammar, lexicon, ignored_tags):
    rd_parser = nltk.RecursiveDescentParser(grammar)
    for entry in lexicon:
        marker_list = [field.tag for field in entry if field.tag not in ignored_tags]
        if list(rd_parser.parse(marker_list)):
            print("+", ':'.join(marker_list))
        else:
            print("-", ':'.join(marker_list))

In [40]:
lexicon = toolbox.xml('rotokas.dic')[10:20]
ignored_tags = ['arg', 'dcsv', 'pt', 'vx']

In [41]:
validate_lexicon(grammar, lexicon, ignored_tags)

- lx:ps:ge:tkp:sf:nt:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:sf:dt
- lx:ps:ge:tkp:dt:cmt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:ge:ge:tkp:cmt:dt:ex:xp:xe
- lx:rt:ps:ge:ge:tkp:dt
- lx:rt:ps:ge:eng:eng:eng:ge:tkp:tkp:dt:cmt:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:dt:ex:xp:xe
- lx:ps:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe


Further Reading: https://www.nltk.org/book/ch11.html#ref-ignored-tags