# 3: Lexicons

* Word "Lists" (Sets)
* Simple dictionary lexicons
* Complex lexicons
* Lexicons in NLTK
* Exercises

## Word "Lists" (Sets)

The simpliest lexicon is just a list of words that have something in common. For example:

* Pronouns ("he","she", "I", ...)
* Negative words ("terrible","jerk","foolishly",...)
* A list of all family relations ("father", "sister",...)
* The vocabulary of the Brown corpus

Typically, one builds a lexicon so that one can identify instances of this word class in some corpus

Rule #1: Don't use lists for lexicons. Use [sets](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)!

In [1]:
some_words = ["the","quick","brown", "fox","jumped", "over","the","lazy","dog"]

lexicon = set(some_words)
lexicon
# lexicon[3]  if you run this, you will get an error saying 
# TypeError: 'set' object is not subscriptable

{'brown', 'dog', 'fox', 'jumped', 'lazy', 'over', 'quick', 'the'}

Two big reasons to use sets for lexicons:

* Elements are unique, which is exactly what you need for lexicons
* As we know, checking for membership (*in*) is much faster, $O(1)$ vs $O(n)$. This especially matters when the lexicon is big.

There are a few ways to create sets. 
- If you are starting totally from scratch, you have to use the build-in *set()* function
    - curly brackets alone refer to dicts, not sets! 
- If you are starting with a (small) fixed set of existing items you can declare using curly brackets instead of converting a list. 
- There are also set comprehensions, also using curly brackets!


In [2]:
seasons = {"spring", "summer", "fall"}
months = {}
days = {n for n in range(31)}

In [3]:
seasons.add("winter")
seasons

{'fall', 'spring', 'summer', 'winter'}

In [4]:
days.add(31)
print(days)

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}


In [5]:
months = set()
months.add("January")
months

{'January'}

It is worth memorizing the basic set methods for adding and removing items, including `add`, `update`, and `discard`

In [6]:
planets = {"Mars","Venus", "Mercury"}
more_planets = ["Jupiter","Neptune,","Uranus", "Pluto"]

In [7]:
planets.add("Earth")
planets

{'Earth', 'Mars', 'Mercury', 'Venus'}

In [8]:
planets.update(more_planets)
planets

{'Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune,', 'Pluto', 'Uranus', 'Venus'}

In [9]:
planets.discard("Pluto")
planets
planets.discard("Ceres")
planets

{'Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune,', 'Uranus', 'Venus'}

Sets are great for quickly finding intersections using `&` and differences using `-`. These are WAY faster than alternatives involving looping. Let's find words only in both the Brown and Treebank, and words that appear in only one or the other

In [10]:
from nltk.corpus import brown, treebank

brown_vocab = set(brown.words())
print('brown vocab: ', len(brown_vocab))
treebank_vocab = set(treebank.words())
print('treebank vocab: ', len(treebank_vocab))

brown vocab:  56057
treebank vocab:  12408


In [11]:
len(brown_vocab - treebank_vocab)

47510

In [12]:
len(treebank_vocab - brown_vocab)

3861

In [13]:
len(treebank_vocab & brown_vocab)

8547

You might sometimes want to intersect a set with something that isn't a set (like a list, or even a string). If so, and it doesn't make sense to convert the other to a set, you can use set methods such as `intersection` instead of operators (which only work when both elements are sets).

In [14]:
vowels = {'a','e','i','o','u','y'}
word1 = "rstln"
word2 = "rstlne"

In [15]:
# len(vowels & word1) 
# If you run this, it will throw an error saying 
# TypeError: unsupported operand type(s) for &: 'set' and 'str'

In [16]:
vowels.intersection(word1)

set()

In [17]:
vowels.intersection(word2)

{'e'}

In [18]:
# word1.intersection(word2)
# This is also wrong because 'str' object has no attribute 'intersection'

Some set drawbacks:

* No order, no guarantee that order will be perserved when the set changes
* Can't add mutuable objects like lists and dicts (use tuples!)

In [19]:
months = ["Jan", "Feb", "Mar"]

list(set(months))

['Feb', 'Mar', 'Jan']

## Simple dictionary lexicons

Most common lexical need in computational linguistics: word counts

Easy way to build a dictionary of counts when you've got a list (or other iterable) of words: [Counters](https://docs.python.org/3/library/collections.html#collections.Counter)

In [20]:
from collections import Counter

counts = Counter(brown.words())
counts["brown"]


65

In [21]:
counts["the"]

62713

You can `update` counters to add more items

In [22]:
counts.update(treebank.words())
counts["the"]

66758

But sometimes not practical because what you are counting isn't an iterable or you want to do some operation before counting. A normal Python dict is fine, but each value needs to be initialized to zero. One solution: the [get](https://docs.python.org/3/library/stdtypes.html#dict.get) method.

In [23]:
counts = {}
for word in brown.words():
    word = word.lower()
    counts[word] = counts.get(word, 0) + 1
     
    
counts["the"]


69971

A quick exercise: let's count the first words of sentences in the Brown corpus using `get`

In [24]:
first_word_count = {}
for sent in brown.sents():
    word = sent[0]
    first_word_count[word] = first_word_count.get(word, 0) + 1
    
print(first_word_count["And"])

789


Yet another solution to the initialization problem for counting (and other situations) is the [defaultdict](https://docs.python.org/3/library/collections.html#collections.defaultdict). When you initialize a defaultdict for integers, the default value is zero, no need to do anything but add! Defaultdicts also work for lists, sets, and dicts (creating an empty instance as soon as you try to access it).

In [25]:
from collections import defaultdict
from nltk.corpus import brown

counts = defaultdict(int)

for word in brown.words():
    counts[word] = counts[word] + 1
    
print(counts["the"])

62713


Another very common need: assign a index (an integer) to each word, for looking up in data structures like matrices.

In [26]:
index_dict = {}

for word in counts:
    index_dict[word] = len(index_dict)
    
index_dict["index"]

10449

In these cases, you will often need a reversed index as well. The `items` method for dictionaries is useful if you are iterating over an existing dictionary and want to access both key and value at the same time. Let's build a reverse index dict in one line using a dictionary comprehension!

In [27]:
rev_dict = {value:key for key, value in index_dict.items()}
rev_dict[10449]

'index'

## Complex lexicons


Sometimes lexicons are more complex and might be represented as multiple recursive Python datatypes. For example, instead of a single counts, you might have a list of word senses, which are actually dictionaries of properties (including part-of-speech and counts of that word sense in a corpus).

In [28]:
mini_sense_lexicon = {"bear":[{"POS":"noun","animate":True,"count":634,"gloss":"A big furry animal"},
                              {"POS":"verb","transitive":True,"count":294, "past tense":"bore", "past participle":"borne", "gloss":"to endure"}],
                      "slug":[{"POS":"noun","animate":True,"count":34, "gloss":"A slimy animal"},
                              {"POS":"verb","transitive":True,"count":3, "gloss": "to hit"}],
                      "back":[{"POS":"noun","animate":False,"count":12,"gloss":"a body part"},
                              {"POS":"noun","animate":False,"count":43, "gloss":"the rear of a place"},
                              {"POS":"verb","transitive":True,"count":5, "gloss":"to support"},
                              {"POS":"adverb","count":47,"gloss":"in a returning fashion"}],
                      "good":[{"POS":"noun","animate":False,"count":19,"gloss":"a thing of value"},
                              {"POS":"adjective", "count":1293,"gloss":"positive"}]}

These can be tricky to navigate. Let's answer the following questions by accessing the information in data structure:

How many senses does "back" have in this lexicon?

In [29]:
len(mini_sense_lexicon["bear"])

2

Does "slug" have an adjectival sense?

In [30]:
any(features["POS"] == "adjective" for features in mini_sense_lexicon["slug"])

False

Which words have a verb sense with an irregular past tense?

In [31]:
for word in mini_sense_lexicon:
    for feature_dict in mini_sense_lexicon[word]:
        if ("past tense" in feature_dict):
            print(word)

bear


What is the gloss of the most common sense of "back"?

In [32]:
highest_val = 0
gloss = ""
for feature_dict in mini_sense_lexicon["back"]:
    if(feature_dict["count"] > highest_val):
        highest_val = feature_dict["count"]
        gloss = feature_dict["gloss"]

print("Highest: ", highest_val)
print("Gloss: ", gloss)

Highest:  47
Gloss:  in a returning fashion


## NLTK lexicons


NLTK has a lot of useful lexicons. Most are word lists. Note that they are listed under corpora and have the same interface; note that they need to be converted to sets if you want to use them for look up.

#### Stopwords

Lists of closed-class/function words in various languages



In [33]:
from nltk.corpus import stopwords

In [34]:
stopwords.words("english")[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [35]:
stopwords.words("french")[:10]

['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle']

#### Swadesh

words for 200 common concepts from large list of languages, used for historical linguistics

In [36]:
from nltk.corpus import swadesh

In [37]:
print(swadesh.fileids())

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']


In [38]:
for i in range(10):
    print(swadesh.words("en")[i], swadesh.words("de")[i]) 

I ich
you (singular), thou du, Sie
he er
we wir
you (plural) ihr, Sie
they sie
this dieses
that jenes
here hier
there dort


#### Names

Lists of mostly English names, divided by gender

In [39]:
from nltk.corpus import names

In [40]:
names.fileids()

['female.txt', 'male.txt']

In [41]:
names.words("female.txt")[-10:]

['Zonnya',
 'Zora',
 'Zorah',
 'Zorana',
 'Zorina',
 'Zorine',
 'Zsa Zsa',
 'Zsazsa',
 'Zulema',
 'Zuzana']

#### Opinion Lexicon

Positive and negative word lists

In [42]:
from nltk.corpus import opinion_lexicon

In [43]:
opinion_lexicon.fileids()

['negative-words.txt', 'positive-words.txt']

In [44]:
opinion_lexicon.words('negative-words.txt')[:10]

['2-faced',
 '2-faces',
 'abnormal',
 'abolish',
 'abominable',
 'abominably',
 'abominate',
 'abomination',
 'abort',
 'aborted']

#### CMU Pronouncing Dictionary

* A list of pronunciations for each English word string
* Pronunciations are a list of ARPAbet phones
* ARPAbet phones are strings which are alphabetic except for numbers at the end of the vowels
* The numbers indicate stress

Let's look at a few entries:

In [45]:
from nltk.corpus import cmudict
p_dict = cmudict.dict()

In [46]:
p_dict["there"]

[['DH', 'EH1', 'R']]

In [47]:
p_dict["their"]

[['DH', 'EH1', 'R']]

In [48]:
p_dict["read"]

[['R', 'EH1', 'D'], ['R', 'IY1', 'D']]

Let's get some basic stats for this lexicon: total entries, average number of pronunciations per word, average number of phones per pronunciation

In [49]:
pro_count = 0
pho_count = 0
for pronuns in p_dict.values():
    pro_count += len(pronuns)
    for pronun in pronuns:
        pho_count += len(pronun)

In [50]:
len(p_dict)

123455

In [51]:
pro_per_word = pro_count / len(p_dict)
pro_per_word

1.0832854076384109

In [52]:
pho_count / (len(p_dict) * pro_per_word)

6.3850542482633825

## Exercises

1. Create a set of female names that appear in the Brown corpus.

In [53]:
types = set(brown.words())
female_names = set(names.words("female.txt"))

female_brown = types & female_names
female_brown
print(len(female_brown))
print(len(female_names))


702
5001


2. (Programmatically) create a set of the words from `mini_sense_lexicon` which have an animate noun sense.

In [54]:
animates = set()
for word in mini_sense_lexicon:
    for feature_dict in mini_sense_lexicon[word]:
        if ("animate" in feature_dict and feature_dict["animate"]):
            animates.add(word)
            
animates
    
    

{'bear', 'slug'}

3. Count how often each English phone appears in CMU lexicon, striping off the stress markers at the end of vowels.

In [55]:
phones = defaultdict(int)
for word in p_dict:
    for pron in p_dict[word]:
        for phone in pron:
            if not phone[-1].isdigit():
                phones[phone] = phones[phone] + 1
            else:
                phones[phone[:-1]] = phones[phone[:-1]] + 1
                
print(phones)

defaultdict(<class 'int'>, {'AH': 71410, 'EY': 13521, 'F': 13748, 'AO': 11290, 'R': 46046, 'T': 48549, 'UW': 9736, 'W': 8864, 'N': 60564, 'IH': 50093, 'P': 19715, 'L': 49479, 'AA': 24546, 'B': 21057, 'ER': 29027, 'G': 13553, 'K': 42502, 'S': 50427, 'EH': 27398, 'TH': 2902, 'M': 29347, 'D': 32389, 'V': 10742, 'Z': 27842, 'IY': 34504, 'AE': 21804, 'OW': 19047, 'NG': 9865, 'SH': 8700, 'HH': 9319, 'AW': 3408, 'AY': 11313, 'JH': 6404, 'Y': 5171, 'CH': 4960, 'ZH': 560, 'UH': 2273, 'DH': 576, 'OY': 1267})
