Similar to chapter 10, this chapter is also text/idea heavy. It focuses on the maintenance of linguistic data, or corpora. 

While this notebook includes code cells for examples of data format conversions, data extraction, and other tasks involved in managing linguistic data, it is recommended to go through the chapter in the book to gain a better understanding of the major ideas involved.

https://www.nltk.org/book/ch11.html

# Managing Linguistic Data

## Corpus Structure: A Case Study

NLTK includes a sample from the TIMIT corpus. You can access its documentation in the usual way, using help(nltk.corpus.timit). Print nltk.corpus.timit.fileids() to see a list of the 160 recorded utterances in the corpus sample.

In [1]:
import nltk

In [3]:
nltk.corpus.timit.fileids()[:10]

['dr1-fvmh0/sa1.phn',
 'dr1-fvmh0/sa1.txt',
 'dr1-fvmh0/sa1.wav',
 'dr1-fvmh0/sa1.wrd',
 'dr1-fvmh0/sa2.phn',
 'dr1-fvmh0/sa2.txt',
 'dr1-fvmh0/sa2.wav',
 'dr1-fvmh0/sa2.wrd',
 'dr1-fvmh0/si1466.phn',
 'dr1-fvmh0/si1466.txt']

Each item has a phonetic transcription which can be accessed using the phones() method.

In [7]:
phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1')
print(phonetic)

['h#', 'sh', 'iy', 'hv', 'ae', 'dcl', 'y', 'ix', 'dcl', 'd', 'aa', 'kcl', 's', 'ux', 'tcl', 'en', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax', 'q', 'ao', 'l', 'y', 'ih', 'ax', 'h#']


In [8]:
nltk.corpus.timit.word_times('dr1-fvmh0/sa1')

[('she', 7812, 10610),
 ('had', 10610, 14496),
 ('your', 14496, 15791),
 ('dark', 15791, 20720),
 ('suit', 20720, 25647),
 ('in', 25647, 26906),
 ('greasy', 26906, 32668),
 ('wash', 32668, 37890),
 ('water', 38531, 42417),
 ('all', 43091, 46052),
 ('year', 46052, 50522)]

In addition to this text data, TIMIT includes a lexicon that provides the canonical pronunciation of every word, which can be compared with a particular utterance:

In [9]:
timitdict = nltk.corpus.timit.transcription_dict()
timitdict['greasy'] + timitdict['wash'] + timitdict['water']

['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't', 'axr']

In [10]:
phonetic[17:30]

['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax']

In [11]:
# Accessing demographic data of a particular ite
nltk.corpus.timit.spkrinfo('dr1-fvmh0')

SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')

  ## Acquiring Data
  
  ### Obtaining Data from the Word Processor Files
  
Consider the following fragment of a lexical entry: "sleep [sli:p] v.i. condition of body and mind...". We can enter this in MSWord, then "Save as Web Page", then inspect the resulting HTML file:

```
<p class=MsoNormal>sleep
  <span style='mso-spacerun:yes'> </span>
  [<span class=SpellE>sli:p</span>]
  <span style='mso-spacerun:yes'> </span>
  <b><span style='font-size:11.0pt'>v.i.</span></b>
  <span style='mso-spacerun:yes'> </span>
  <i>a condition of body and mind ...<o:p></o:p></i>
</p>
```

In [14]:
import re

In [17]:
legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])
pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")
document = open("data/dict.htm", encoding="windows-1252").read()
document

"<p class=MsoNormal>sleep\n  <span style='mso-spacerun:yes'> </span>\n  [<span class=SpellE>sli:p</span>]\n  <span style='mso-spacerun:yes'> </span>\n  <b><span style='font-size:11.0pt'>v.i.</span></b>\n  <span style='mso-spacerun:yes'> </span>\n  <i>a condition of body and mind ...<o:p></o:p></i>\n</p>"

We can inspect the used pos tags as follows:

In [18]:
used_pos = set(re.findall(pattern, document))

In [19]:
used_pos

{'v.i.'}

Once we know the data is correctly formatted, we can write other programs to convert the data into a different format.

### Obtaining Data from Spreadsheets and Databases

Our program might perform a linguistically motivated query which cannot be expressed in SQL, e.g. select all words that appear in example sentences for which no dictionary entry is provided. For this task, we would need to extract enough information from a record for it to be uniquely identified, along with the headwords and example sentences. Let's suppose this information was now available in a CSV file dict.csv:

```
"sleep","sli:p","v.i","a condition of body and mind ..."
"walk","wo:k","v.intr","progress by lifting and setting down each foot ..."
"wake","weik","intrans","cease to sleep"
```

We can express this query in code below:

In [20]:
import csv

In [28]:
lexicon = csv.reader(open('data/dict.csv'))

In [29]:
pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon]

In [33]:
lexemes, defns = zip(*pairs)
defn_words = set(w for defn in defns for w in defn.split())
print(defn_words)

{'each', 'down', 'mind', 'and', 'body', 'setting', 'cease', 'lifting', 'condition', 'foot', 'by', '...', 'to', 'progress', 'sleep', 'a', 'of'}


In [35]:
print(sorted(defn_words.difference(lexemes)))

['...', 'a', 'and', 'body', 'by', 'cease', 'condition', 'down', 'each', 'foot', 'lifting', 'mind', 'of', 'progress', 'setting', 'to']


We can use this information to enrich the lexicon.

### Converting Data Formats

Consider a case where we had to have a build an inverted index using some input data. 

We can continue the use of the data from the above example, and consider an example where we have to construct an index that maps the words of a dictionary definition to the corresponding index.

In [36]:
idx = nltk.Index((defn_word, lexeme) 
                 for (lexeme, defn) in pairs 
                 for defn_word in nltk.word_tokenize(defn) 
                 if len(defn_word) > 3)

In [37]:
idx

Index(list,
      {'condition': ['sleep'],
       'body': ['sleep'],
       'mind': ['sleep'],
       'progress': ['walk'],
       'lifting': ['walk'],
       'setting': ['walk'],
       'down': ['walk'],
       'each': ['walk'],
       'foot': ['walk'],
       'cease': ['wake'],
       'sleep': ['wake']})

In [40]:
with open("output/dict.idx", "w") as idx_file:
     for word in sorted(idx):
            idx_words = ', '.join(idx[word])
            idx_line = "{}: {}".format(word, idx_words)
            print(idx_line, file=idx_file)

### Lookup by Pronunciation Similarity

First, we identify confusible letter sequences and map complex versions to simpler versions.

In [41]:
mappings = [('ph', 'f'), ('ght', 't'), ('^kn', 'n'), ('qu', 'kw'),
             ('[aeiou]+', 'a'), (r'(.)\1', r'\1')]

In [42]:
def signature(word):
    for patt, repl in mappings: 
        word = re.sub(patt, repl, word)
    pieces = re.findall("[^aeiou]+", word)
    return "".join(char for piece in pieces for char in sorted(piece))[:8]

In [43]:
signature("illefent")

'lfnt'

In [44]:
signature("elephant")

'lfnt'

In [45]:
signature("ebsekwieous")

'bskws'

In [46]:
signature("nuculerr")

'nclr'

Next, we create a mapping from signatures to words, for all words in our lexicon.

In [47]:
signatures = nltk.Index((signature(w), w) for w in nltk.corpus.words.words())

In [49]:
print(signatures[signature("nuclerr")])

['anicular', 'inocular', 'nucellar', 'nuclear', 'unicolor', 'uniocular', 'unocular']


Finally, we rank the results in terms of similarity to the original word.

In [51]:
def rank(word, wordlist):
    ranked = sorted((nltk.edit_distance(word, w), w) for w in wordlist)
    return [word for (_, word) in ranked]

In [52]:
def fuzzy_spell(word):
    sig = signature(word)
    if sig in signatures:
        return rank(word, signatures[sig])
    return []

In [None]:
fuzzy_spell("illefent")