# 1. Lexical semantics

* What do words mean?
* WordNet
* Word Sense Disambiguation
* Content analysis

## What do words mean?

* Dictionary definitions?
* Relationship to other words?
* Lexical categories?
* Linguistic context in which they appear?
* Referents in the "real world"?
* Effects in the mind of a listener?

## WordNet

[WordNet](https://wordnet.princeton.edu/) is a lexical database that structures words primarily in terms of two key relationships: *synonymy* and *hypernomy*. Synonymy is when two words mean (approximately) the same thing, and WordNet represents synonymy by organizing words into groups of synonyms called *synsets*. Ironically, the synset, not the word, is the fundamental unit of WordNet.

Let's start by looking at the size of (English) WordNet in terms of different parts of speech. We can get all the synsets for a POS using the `all_synsets` command

In [56]:
from nltk.corpus import wordnet as wn

pos_list = [wn.NOUN,wn.VERB,wn.ADJ,wn.ADV]

for pos in pos_list:
    print(pos)

    synsets = list(wn.all_synsets(pos))

    print(len(synsets))
    print(synsets[:10])


n
82115
[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('abstraction.n.06'), Synset('thing.n.12'), Synset('object.n.01'), Synset('whole.n.02'), Synset('congener.n.03'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('benthos.n.02')]
v
13767
[Synset('breathe.v.01'), Synset('respire.v.02'), Synset('respire.v.01'), Synset('choke.v.01'), Synset('hyperventilate.v.02'), Synset('hyperventilate.v.01'), Synset('aspirate.v.03'), Synset('burp.v.01'), Synset('force_out.v.08'), Synset('hiccup.v.01')]
a
18156
[Synset('able.a.01'), Synset('unable.a.01'), Synset('abaxial.a.01'), Synset('adaxial.a.01'), Synset('acroscopic.a.01'), Synset('basiscopic.a.01'), Synset('abducent.a.01'), Synset('adducent.a.01'), Synset('nascent.a.01'), Synset('emergent.s.02')]
r
3621
[Synset('a_cappella.r.01'), Synset('ad.r.01'), Synset('ce.r.01'), Synset('bc.r.01'), Synset('bce.r.01'), Synset('horseback.r.01'), Synset('barely.r.01'), Synset('just.r.06'), Synset('hardly.r.02'), Synset('anisotropic

We can see that WordNet is dominated by nouns; the vast majority of synsets are nominal, indicated by "n". They are represented by an identifier consisting of 

1. the most prominent synonym
2. the part of speech
3. an id number (seemingly arbitrary) which distinguishes synsets for which the first of these two are the same.

If we know this identifier, we can get the corresponding synset object, using `synset`. Let's start with the very top of the WordNet hierarchy: *entity*.

In [57]:
wn.synset("entity.n.01")

Synset('entity.n.01')

A more typical way to access synsets in WordNet, though, is by looking up a word to get its `synsets`. Many words have more than one synset, which indicates that it has more than one meaning, and is therefore ambiguous. Lets look at the word *goal*.

In [58]:
goal_synsets = wn.synsets("goal")
goal_synsets

[Synset('goal.n.01'),
 Synset('finish.n.04'),
 Synset('goal.n.03'),
 Synset('goal.n.04')]

The synset identifier (which you can get as a string by using the `name()` method for synsets) often isn't terribly helpful for figuring out what a synset means. Fortunately, every synset has a *gloss* or definition associated with it, which can be accessed using the `definition()`. Some synsets also have examples, accessible via the `examples()` method though note that the example may or may not include the specific form you're interested in

In [59]:
for synset in goal_synsets:
    print(synset.name())
    print(synset.definition())
    print(synset.examples())


goal.n.01
the state of affairs that a plan is intended to achieve and that (when achieved) terminates behavior intended to achieve it
['the ends justify the means']
finish.n.04
the place designated as the end (as of a race or journey)
['a crowd assembled at the finish', 'he was nearly exhausted as their destination came into view']
goal.n.03
game equipment consisting of the place toward which players of a game try to advance a ball or puck in order to score points
[]
goal.n.04
a successful attempt at scoring
['the winning goal came with less than a minute left to play']


We can find the other words associated with a particular synset using the `lemmas()` method, which accesses a corresponding list of `Lemma` objects

In [60]:
goal_synsets[0].lemmas()


[Lemma('goal.n.01.goal'), Lemma('goal.n.01.end')]

In [61]:
goal_synsets[1].lemmas()

[Lemma('finish.n.04.finish'),
 Lemma('finish.n.04.destination'),
 Lemma('finish.n.04.goal')]

Lemmas have a few useful methods. You can access the words (which appear at the end of the 4-part lemma identifiers) using `name()`.

In [62]:
goal_synsets[0].lemmas()[0].name()

'goal'

Each lemma also has a `count` associated with it, which comes from a sense-annotated portion of the Brown corpus called *semcor*. Using these counts, we can guess at which sense of a word is more common. 

> Note: This is counting how many words have the ultimate lemma of "goal". Or see the next cell for clarification.

In [63]:
goal_synsets[0].lemmas()[0].count()

34

If we want to know which senses of a word are more common, we can iterate over its senses, find the corresponding lemma, and check its counts

In [64]:
for synset in wn.synsets("goal"):
    for lemma in synset.lemmas():
        if lemma.name() == "goal":
            print(synset.definition())
            print(lemma.count())

the state of affairs that a plan is intended to achieve and that (when achieved) terminates behavior intended to achieve it
34
the place designated as the end (as of a race or journey)
1
game equipment consisting of the place toward which players of a game try to advance a ball or puck in order to score points
0
a successful attempt at scoring
0


Note that we can access WordNet synsets using inflected forms, and this is often a good idea since inflected forms often partially disambiguate words.

In [65]:
for synset in wn.synsets("watched"):
    for lemma in synset.lemmas():
        print(lemma)

Lemma('watch.v.01.watch')
Lemma('watch.v.02.watch')
Lemma('watch.v.02.observe')
Lemma('watch.v.02.follow')
Lemma('watch.v.02.watch_over')
Lemma('watch.v.02.keep_an_eye_on')
Lemma('watch.v.03.watch')
Lemma('watch.v.03.view')
Lemma('watch.v.03.see')
Lemma('watch.v.03.catch')
Lemma('watch.v.03.take_in')
Lemma('watch.v.04.watch')
Lemma('watch.v.04.look_on')
Lemma('watch.v.05.watch')
Lemma('watch.v.05.look_out')
Lemma('watch.v.05.watch_out')
Lemma('watch.v.06.watch')
Lemma('determine.v.08.determine')
Lemma('determine.v.08.check')
Lemma('determine.v.08.find_out')
Lemma('determine.v.08.see')
Lemma('determine.v.08.ascertain')
Lemma('determine.v.08.watch')
Lemma('determine.v.08.learn')


Generally, WordNet sysnets are very fine-grained. This is actually a problem because it can be very difficult for both humans and computers to differentiate subtle sense distinctions. Though these distinctions may matter to lexicographers, for many computational applications it makes sense to collapse sense, or simply ignore rare ones.

In [66]:
for synset in wn.synsets("watch"):
    print(synset.name())
    print(synset.definition())

watch.n.01
a small portable timepiece
watch.n.02
a period of time (4 or 2 hours) during which some of a ship's crew are on duty
watch.n.03
a purposeful surveillance to guard or observe
watch.n.04
the period during which someone (especially a guard) is on duty
lookout.n.01
a person employed to keep watch for some anticipated event
vigil.n.02
the rite of staying awake for devotional purposes (especially on the eve of a religious festival)
watch.v.01
look attentively
watch.v.02
follow with the eyes or the mind
watch.v.03
see or watch
watch.v.04
observe with attention
watch.v.05
be vigilant, be on the lookout or be careful
watch.v.06
observe or determine by looking
determine.v.08
find out, learn, or determine with certainty, usually by making an inquiry or other effort


For example: Let's pick an ambiguous word and try to enumerate its senses. Then we'll see how many senses it has in WordNet.

In [67]:
for synset in wn.synsets("light"):
    print(synset.name())
    print(synset.definition())

light.n.01
(physics) electromagnetic radiation that can produce a visual sensation
light.n.02
any device serving as a source of illumination
light.n.03
a particular perspective or aspect of a situation
luminosity.n.01
the quality of being luminous; emitting or reflecting light
light.n.05
an illuminated area
light.n.06
a condition of spiritual awareness; divine illumination
light.n.07
the visual effect of illumination on objects or scenes as created in pictures
light.n.08
a person regarded very fondly
light.n.09
having abundant light or illumination
light.n.10
mental understanding as an enlightening experience
sparkle.n.01
merriment expressed by a brightness or gleam or animation of countenance
light.n.12
public awareness
inner_light.n.01
a divine presence believed by Quakers to enlighten and guide the soul
light.n.14
lighter.n.02
a device for lighting or igniting fuel or charges or fires
light.v.01
make lighter or brighter
light_up.v.05
begin to smoke
alight.v.01
to come to rest, settl

The other key organizing relationship is WordNet is the hypernym/hyponym relation, which is the *is kind of* relation. Put abstractly, if A is a kind of B, then B is the hypernym of A, and A is the hyponym of B. Note that a synset can have multiple hyponyms and multiple hypernyms. However, the latter is much rarer and we won't address it here, which allows us to treat WordNet like a tree.

For a more concrete example, a cat is an animal, so "cat" is a a *hyponym* of "animal", and "animal" is a *hypernym* of "cat".

Let's look at some of these relations in WordNet.

In [68]:
cat_synset = wn.synset("cat.n.01")

In [69]:
cat_synset.hypernyms()[0]

Synset('feline.n.01')

In [70]:
cat_synset.hypernyms()[0].hypernyms()[0]

Synset('carnivore.n.01')

In [71]:
cat_synset.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0]

Synset('whole.n.02')

In [72]:
cat_synset.hyponyms()[0]

Synset('domestic_cat.n.01')

If a word does not have any hyponyms:

In [73]:
cat_synset.hyponyms()[0].hyponyms()[0].hyponyms()

[]

In [74]:
money_synset = wn.synset("coin.n.01")

In [75]:
money_synset.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()[0]

Synset('entity.n.01')

In [76]:
money_synset.hyponyms()

[Synset('bawbee.n.01'),
 Synset('bezant.n.01'),
 Synset('change.n.08'),
 Synset('crown.n.06'),
 Synset('denier.n.02'),
 Synset('dime.n.01'),
 Synset('dollar.n.03'),
 Synset('double_eagle.n.02'),
 Synset('doubloon.n.01'),
 Synset('ducat.n.01'),
 Synset('eagle.n.03'),
 Synset('eightpence.n.01'),
 Synset('farthing.n.01'),
 Synset('fivepence.n.01'),
 Synset('fourpence.n.01'),
 Synset('guinea.n.01'),
 Synset('half_crown.n.01'),
 Synset('half_dollar.n.01'),
 Synset('half_eagle.n.01'),
 Synset('halfpenny.n.01'),
 Synset('louis_d'or.n.01'),
 Synset('maundy_money.n.01'),
 Synset('medallion.n.01'),
 Synset('nickel.n.02'),
 Synset('ninepence.n.01'),
 Synset('penny.n.02'),
 Synset('piece_of_eight.n.01'),
 Synset('quarter.n.10'),
 Synset('real.n.03'),
 Synset('shilling.n.06'),
 Synset('sixpence.n.01'),
 Synset('slug.n.03'),
 Synset('sou.n.01'),
 Synset('stater.n.01'),
 Synset('tenpence.n.01'),
 Synset('threepence.n.01'),
 Synset('twopence.n.01')]

Every noun synset in WordNet is connected to every other noun synset in WordNet via a path of hypernym/hyponym relationships. If you follow up the hypernyms of any noun far enough, you will reach the *entity.n.01* node. This connection allows for the calculation of similarity measures. The simpliest form of this, `path_similarity` is just the inverse of the number of steps required to go from one synset to another in wordnet

In [77]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
coin = wn.synset("coin.n.01")

In [78]:
dog.path_similarity(cat)

0.2

In [79]:
dog.path_similarity(coin)

0.058823529411764705

Path similarity isn't a great measure because even though it does often get the basic relationships right, it doesn't make good use of its range (0, 1). In this case, it feels like dog and cat should be more similar than 0.2. 

And there's another serious problem. If we pick siblings higher in the hypernym tree, we will find larger distances than 0.2 between very distinct concepts. (while this should give smaller similarities)

In [80]:
wn.synset('physical_entity.n.01').hyponyms()

[Synset('causal_agent.n.01'),
 Synset('matter.n.03'),
 Synset('object.n.01'),
 Synset('process.n.06'),
 Synset('substance.n.04'),
 Synset('thing.n.12')]

In [81]:
wn.synset('process.n.06').path_similarity(wn.synset('matter.n.03'))

0.3333333333333333

 A more sophisticated measure, `wup_similarity` addresses this problem by using depth instead of distance and features the lowest common subsumer (i.e. the nearest shared hypernym of two synsets); two synsets will have high similarity if their LCS is deep in the heirarchy. The equation is

$$
wup\_similarity(A,B) = \frac{2*depth(LCS(A,B))}{depth(A) + depth(B)}
$$

Let's draw some trees and see how this formula works.
<br>
<br> 
<br> 
<br> 
<br> 
<br> 
<br> 
<br> 
<br> 

In [82]:
dog.wup_similarity(cat)

0.8571428571428571

In [83]:
dog.wup_similarity(coin)

0.1111111111111111

In [84]:
wn.synset('process.n.06').wup_similarity(wn.synset('matter.n.03'))

0.6666666666666666

There are even more sophisticated ways to calcluate similarity using WordNet relations, though they require sense tagged corpora. We can use the equation above, but instead of using the WordNet *depth* of A, B, and their LCS, we use their *information* based on their probability in a corpus.

Though hyponym/hypernym form the core of WordNet, it has a several other relationships between synsets. Meronymy/holonymy is the *is part of* relation, if A is a part of B, then A is meronym of B, and B is a holonym of A. WordNet distinguishes between `part_meronyms` and `substance_meronyms`

"wheel" is *meronym* of "car", and "car" is a *holonym* of "wheel".  Unlike the hypernym/hyponym relationship, A is not "belonging to a smaller class" than B - B actually contains A.  This is sometimes also referred to as a *has a* relation: "A car has a wheel".

In [85]:
car = wn.synset('car.n.01')
#my code here
car.part_meronyms()
#my code here

[Synset('accelerator.n.01'),
 Synset('air_bag.n.01'),
 Synset('auto_accessory.n.01'),
 Synset('automobile_engine.n.01'),
 Synset('automobile_horn.n.01'),
 Synset('buffer.n.06'),
 Synset('bumper.n.02'),
 Synset('car_door.n.01'),
 Synset('car_mirror.n.01'),
 Synset('car_seat.n.01'),
 Synset('car_window.n.01'),
 Synset('fender.n.01'),
 Synset('first_gear.n.01'),
 Synset('floorboard.n.02'),
 Synset('gasoline_engine.n.01'),
 Synset('glove_compartment.n.01'),
 Synset('grille.n.02'),
 Synset('high_gear.n.01'),
 Synset('hood.n.09'),
 Synset('luggage_compartment.n.01'),
 Synset('rear_window.n.01'),
 Synset('reverse.n.02'),
 Synset('roof.n.02'),
 Synset('running_board.n.01'),
 Synset('stabilizer_bar.n.01'),
 Synset('sunroof.n.01'),
 Synset('tail_fin.n.02'),
 Synset('third_gear.n.01'),
 Synset('window.n.02')]

In [86]:
wood = wn.synset('wood.n.01')
wood.substance_holonyms()


[Synset('beam.n.02'),
 Synset('chopping_block.n.01'),
 Synset('lumber.n.01'),
 Synset('spindle.n.02')]

There are also antonyms in WordNet, though note that they are tied to particular lemmas! Why? - unlike synonyms, which all kind of mean the same thing, antonyms are often very specific to one meaning.  Consider "big/little brother" - we can't say "small brother"! And they are much more common with adjectives, though there are nouns with antonyms too.

In [87]:
abstract_syn = wn.synset('abstract.a.01')
abstract_syn.lemmas()

[Lemma('abstract.a.01.abstract')]

In [88]:
abstract_syn.lemmas()[0].antonyms()

[Lemma('concrete.a.01.concrete')]

In [89]:
wn.synset('sister.n.01').lemmas()[0].antonyms()

[Lemma('brother.n.01.brother')]

Finally, NLTK only has the English WordNet, but WordNets for other languages exist, see [this site](http://globalwordnet.org/resources/wordnets-in-the-world/). NTLK does include non-English lemmas for English synsets, via the [Open Multilingual WordNet (OMW)](http://compling.hss.ntu.edu.sg/omw/). This provides a nice multilingual lexicon, and allows you to calculate things like WordNet distances for other languages using the English WordNet.

In [90]:
import nltk
nltk.download('omw')

[nltk_data] Downloading package omw to /Users/lxy/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

In [91]:
for lang in wn.langs():
    print("---" + lang + "----")
    
    for lemma in dog.lemma_names(lang):
        print(lemma)

---eng----
dog
domestic_dog
Canis_familiaris
---als----
---arb----
كلْب
---bul----
куче
---cat----
ca
canis_familiaris
gos
gos_domèstic
---cmn----
犬
狗
---dan----
hund
køter
vovhund
vovse
---ell----
σκύλος_γένους_Canis_familiaris
---eus----
or
txakur
zakur
---fas----
---fin----
Canis_familiaris
koira
---fra----
canis_familiaris
chien
---glg----
can
Canis_familiaris
---heb----
---hrv----
Canis_lupus_familiaris
domaći_pas
pas
---ind----
anjing
---ita----
cane
Canis_familiaris
---jpn----
イヌ
ドッグ
洋犬
犬
飼犬
飼い犬
---nld----
hond
joekel
---nno----
bisk
hund
kjøter
---nob----
bisk
hund
kjøter
---pol----
pies
pies_domowy
---por----
cachorra
cachorro
cadela
cão
---qcn----
---slv----
canis_familiaris
pes
---spa----
can
perro
---swe----
hund
---tha----
หมา
สุนัข
หมาบ้าน
---zsm----
anjing


## Word Sense Disambiguation

Word Sense Disambiguation is the task of determining which sense a particular instance of a word is. Given a collection of possible senses (synsets) for a particular word, we can view it as a classification task for each ambiguous word token. 

WSD is a fundamental problem in computational linguistics, but it actually rarely included explicitly in pipelines for major semantic tasks. One reason is that good, supervised WSD requires a separate model for *each ambiguous word type*, which a major overhead even assuming you have reliable annotations of sense (which are expensive to create). But note that some of the sophisticated neural methods like BERT are clearly doing implicit WSD. 

To get at another reason why WSD isn't a standard part of most NLP pipelines, let's look at some examples of the word "interest" from two corpora, the Treebank and the Gutenberg:

In [92]:
from nltk.corpus import treebank,gutenberg

def print_10_examples(corpus, target):
    examples = []
    for sent in corpus.sents():
        for word in sent:
            if word.lower() == target:
                examples.append(" ".join(sent))
                break     # exit out of a loop when this external condition is triggered
        if len(examples) == 5:
            break
    print("\n".join(examples))

for synset in wn.synsets("interest")[:5]:
    print(synset.name())
    print(synset.definition())

print("-----\nTreebank\n----")
print_10_examples(treebank,"interest")
print("-----\nGutenberg\n-----")
print_10_examples(gutenberg,"interest")



interest.n.01
a sense of concern with and curiosity about someone or something
sake.n.01
a reason for wanting something done
interest.n.03
the power of attracting or holding one's attention (because it is unusual or exciting etc.)
interest.n.04
a fixed charge for borrowing money; usually a percentage of the amount borrowed
interest.n.05
(law) a right or legal share of something; a financial involvement with something
-----
Treebank
----
Yields on money-market mutual funds continued *-1 to slide , amid signs that portfolio managers expect further declines in interest rates .
Longer maturities are thought *-1 to indicate declining interest rates because they permit portfolio managers to retain relatively higher rates for a longer period .
Nevertheless , said *T*-1 Brenda Malizia Negus , editor of Money Fund Report , yields `` may blip up again before they blip down '' because of recent rises in short-term interest rates .
J.P. Bolduc , vice chairman of W.R. Grace & Co. , which *T*-10 hol

In any particular corpus, one sense (or a small group of closely related senses) will often dominate. This means that is often very difficult to beat the *most frequent sense* (MFS) baseline accuracy, i.e. the accuracy if the system just always guessed the most common sense.

NLTK includes a corpus called semcor which (as noted earlier) is part of the Brown corpus which has been sense tagged (it is also chunked, with senses assigned to the chunks rather than words). Let's take a look at it, and then use it to calculate the MFS accuracy for the word *interest* in that corpus.

In [93]:
from collections import defaultdict
import nltk
# nltk.download('semcor')
from nltk.corpus import semcor

print(semcor.tagged_sents(tag='sense')[0])

[Tree('DT', ['The']), Tree(Lemma('group.n.01.group'), [Tree('NE', [Tree('NNP', ['Fulton', 'County', 'Grand', 'Jury'])])]), Tree(Lemma('state.v.01.say'), [Tree('VB', ['said'])]), Tree(Lemma('friday.n.01.Friday'), [Tree('NN', ['Friday'])]), Tree('DT', ['an']), Tree(Lemma('probe.n.01.investigation'), [Tree('NN', ['investigation'])]), Tree('IN', ['of']), Tree(Lemma('atlanta.n.01.Atlanta'), [Tree('NN', ['Atlanta'])]), Tree('POS', ["'s"]), Tree(Lemma('late.s.03.recent'), [Tree('JJ', ['recent'])]), Tree(Lemma('primary.n.01.primary_election'), [Tree('NN', ['primary', 'election'])]), Tree(Lemma('produce.v.04.produce'), [Tree('VB', ['produced'])]), Tree(None, ['``']), Tree('DT', ['no']), Tree(Lemma('evidence.n.01.evidence'), [Tree('NN', ['evidence'])]), Tree(None, ["''"]), Tree('IN', ['that']), Tree('DT', ['any']), Tree(Lemma('abnormality.n.04.irregularity'), [Tree('NN', ['irregularities'])]), Tree(Lemma('happen.v.01.take_place'), [Tree('VB', ['took', 'place'])]), Tree(None, ['.'])]


In [94]:
wanted_word = "interest"
sense_counts = defaultdict(int)

for sent in semcor.tagged_sents(tag='sense'):
    for chunk in sent:
        lemma = chunk.label()
        try:
            lemma_name = lemma.name()
        except:
            lemma_name = None
        if lemma_name == wanted_word:
            sense_counts[lemma.synset().name()] += 1
            if sum(sense_counts.values()) % 10 == 0:
                print(sum(sense_counts.values()))
        #except:
        #    pass
                
print(sense_counts)
            

10
20
30
40
50
60
70
80
90
100
110
120
130
140


In [None]:
print(wn.synset('interest.n.01').definition())
print(max(sense_counts.values())/sum(sense_counts.values()))  

a sense of concern with and curiosity about someone or something
0.3904109589041096


Let's build a simple decision tree classifier which uses the words appearing immediately before and after as features, and see if we can beat this baseline.

In [None]:
len(semcor.tagged_sents(tag='hi'))

37176

In [None]:
semcor.tagged_sents(tag='hi')[0]

[Tree('DT', ['The']),
 Tree(Lemma('group.n.01.group'), [Tree('NE', [Tree('NNP', ['Fulton', 'County', 'Grand', 'Jury'])])]),
 Tree(Lemma('state.v.01.say'), [Tree('VB', ['said'])]),
 Tree(Lemma('friday.n.01.Friday'), [Tree('NN', ['Friday'])]),
 Tree('DT', ['an']),
 Tree(Lemma('probe.n.01.investigation'), [Tree('NN', ['investigation'])]),
 Tree('IN', ['of']),
 Tree(Lemma('atlanta.n.01.Atlanta'), [Tree('NN', ['Atlanta'])]),
 Tree('POS', ["'s"]),
 Tree(Lemma('late.s.03.recent'), [Tree('JJ', ['recent'])]),
 Tree(Lemma('primary.n.01.primary_election'), [Tree('NN', ['primary', 'election'])]),
 Tree(Lemma('produce.v.04.produce'), [Tree('VB', ['produced'])]),
 Tree(None, ['``']),
 Tree('DT', ['no']),
 Tree(Lemma('evidence.n.01.evidence'), [Tree('NN', ['evidence'])]),
 Tree(None, ["''"]),
 Tree('IN', ['that']),
 Tree('DT', ['any']),
 Tree(Lemma('abnormality.n.04.irregularity'), [Tree('NN', ['irregularities'])]),
 Tree(Lemma('happen.v.01.take_place'), [Tree('VB', ['took', 'place'])]),
 Tree(None, 

See this [documentation for semcor](https://www.nltk.org/_modules/nltk/corpus/reader/semcor.html) for more detail.

`semcor.tagged_sents` returns sentences that are tagged as one situation.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import cross_val_score
import numpy as np

correct_senses = []
feature_dicts = []

for sent in semcor.tagged_sents(tag='sense'):
    for i in range(len(sent)):
        chunk = sent[i]
        lemma = chunk.label()
        try: 
            lemma_name = lemma.name()
        except:
            lemma_name = None
        if lemma_name == wanted_word:
            correct_senses.append(lemma.synset().name())
            feature_dict = {}
            if i > 0:
                feature_dict["BEFORE_" + sent[i-1].leaves()[-1].lower()] = 1
            if i < len(sent):
                feature_dict["AFTER_" + sent[i+1].leaves()[-1].lower()] = 1
            feature_dicts.append(feature_dict)


vectorizer = DictVectorizer()
X = vectorizer.fit_transform(feature_dicts)
clf = DecisionTreeClassifier(max_depth=3)
print(np.mean(cross_val_score(clf,X,correct_senses,cv=5)))


0.5549425287356322




We could probably do better at this task if we include a larger context around the word, and generalize some of the features. 

A classic approach to WSD which doesn't rely on tagged examples in a corpus is the "unsupervised" **Lesk algorithm**, which instead uses the information in the gloss in Wordnet. The simpliest version of Lesk, which is included in NLTK, just picks the sense whose definition has the most word overlap with the context around the word.

In [None]:
treebank.tagged_sents()

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]

> See [Lesk algorithm document](https://www.nltk.org/howto/wsd.html) for more usage information.

In [None]:
# example
from nltk.wsd import lesk
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
print(lesk(sent, 'bank', 'n'))

Synset('savings_bank.n.02')


In [None]:
from nltk.wsd import lesk

target = "interest"

for sent in treebank.tagged_sents():
    for word,pos in sent:
        if word.lower() == target:
            print(" ".join([word for word,pos in sent]))
            synset = lesk([word.lower() for word,pos in sent], 'interest', pos[0].lower())
            print(synset.name())
            print(synset.definition())
            break


Yields on money-market mutual funds continued *-1 to slide , amid signs that portfolio managers expect further declines in interest rates .
pastime.n.01
a diversion that occupies one's time and thoughts (usually pleasantly)
Longer maturities are thought *-1 to indicate declining interest rates because they permit portfolio managers to retain relatively higher rates for a longer period .
sake.n.01
a reason for wanting something done
Nevertheless , said *T*-1 Brenda Malizia Negus , editor of Money Fund Report , yields `` may blip up again before they blip down '' because of recent rises in short-term interest rates .
interest.n.06
(usually plural) a social group whose members control some field of activity and who have common aims
J.P. Bolduc , vice chairman of W.R. Grace & Co. , which *T*-10 holds a 83.4 % interest in this energy-services company , was elected *-10 a director .
interest.n.06
(usually plural) a social group whose members control some field of activity and who have common

There are various problems with the simple Lesk. One issue is that function words can end up having a major effect; this can be mitigated by using a stopword list. Lesk works much better if you don't rely on direct word overlap but rather general semantic relatedness, for instance using the word representations we will see later.

<br>

In certain cases, it is possible to make use of unlabelled corpora for WSD, if you start with a bit of knowledge. If you know that "rates" is an unambigious indicator of the money sense of "interest", you can use that as a jumping off point to learn more about that sense to build a supervised classifier. This is semi-supervised learning.

Let's count the words that appear near "interest rates" and "interest in" in the Brown and look at them to see if they could be used to distinguish the sense of "interest".

In [None]:
from collections import Counter
cnt = Counter()
cnt.update(['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
)
cnt

Counter({'The': 1,
         'Fulton': 1,
         'County': 1,
         'Grand': 1,
         'Jury': 1,
         'said': 1,
         'Friday': 1,
         'an': 1,
         'investigation': 1,
         'of': 1,
         "Atlanta's": 1,
         'recent': 1,
         'primary': 1,
         'election': 1,
         'produced': 1,
         '``': 1,
         'no': 1,
         'evidence': 1,
         "''": 1,
         'that': 1,
         'any': 1,
         'irregularities': 1,
         'took': 1,
         'place': 1,
         '.': 1})

In [None]:
# Here gives you an idea how the brown.sents() looks like
from nltk.corpus import brown
for sent in brown.sents()[:5]:
    print(sent)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']
['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']
['``', 'Only', 'a', 'relative', 'handful', 'of', 'such', 'reports

In [None]:
from collections import Counter
from nltk.corpus import brown

def get_context_around_words(word1,word2):
    '''given a pair of words, identify cases where the two words appear,
    and get other words in the sentence context'''
    context_counter = Counter()
    for sent in brown.sents():
        
        for i in range(len(sent) - 1):
            if sent[i].lower() == word1 and sent[i + 1].lower() == word2:
                context_counter.update(sent[:i])
                context_counter.update(sent[i+2:])
        
    return context_counter


In [None]:
print(get_context_around_words("interest","in"))

Counter({',': 122, 'the': 113, 'of': 74, '.': 66, 'and': 57, 'to': 51, 'a': 41, 'in': 38, 'that': 27, 'is': 22, 'their': 22, ';': 22, 'for': 20, 'have': 16, 'his': 16, 'was': 14, 'an': 14, '``': 13, 'he': 13, "''": 12, 'which': 12, 'by': 11, 'The': 11, 'who': 10, 'with': 10, 'her': 10, 'been': 10, 'on': 9, 'be': 9, 'or': 9, 'this': 8, 'from': 8, 'has': 8, 'had': 8, 'are': 7, 'it': 7, 'not': 6, 'interest': 6, 'about': 5, 'at': 5, 'as': 5, 'old': 5, '(': 5, ')': 5, 'many': 5, 'His': 5, 'fact': 5, 'would': 5, 'school': 5, 'own': 5, 'they': 5, 'out': 4, 'A': 4, 'will': 4, 'than': 4, 'new': 4, 'him': 4, 'no': 4, 'I': 4, 'any': 4, 'one': 4, 'He': 4, 'public': 4, 'jazz': 4, '--': 4, 'In': 4, 'up': 4, 'county': 3, 'others': 3, 'later': 3, 'even': 3, 'those': 3, 'between': 3, 'State': 3, 'other': 3, 'political': 3, 'people': 3, 'enough': 3, 'take': 3, 'much': 3, 'American': 3, 'how': 3, 'personal': 3, 'must': 3, 'Massachusetts': 3, 'men': 3, 'business': 3, 'affairs': 3, 'but': 3, 'often': 3, 'l

In [None]:
print(get_context_around_words("interest","rates"))

Counter({'the': 49, ',': 34, 'of': 21, 'to': 20, 'and': 19, 'in': 16, '.': 14, 'will': 11, 'as': 10, 'a': 9, 'that': 8, 'market': 6, 'during': 6, 'which': 5, 'be': 5, 'is': 5, 'for': 5, '1961': 5, 'capital': 5, 'Federal': 5, 'business': 4, 'on': 4, 'It': 4, 'spring': 4, 'may': 4, 'its': 4, '?': 4, '``': 3, "''": 3, 'months': 3, 'years': 3, 'run': 3, 'about': 3, 'are': 3, 'with': 3, 'question': 3, 'general': 3, 'monetary': 3, 'policies': 3, 'next': 3, 'long-term': 3, 'lower': 3, 'move': 3, 'rates': 3, 'bonds': 3, 'ease': 3, 'interest': 3, 'forces': 3, 'decline': 3, 'In': 2, 'believe': 2, 'by': 2, 'high': 2, 'has': 2, 'range': 2, 'down': 2, 'Administration': 2, 'activity': 2, 'By': 2, 'demand': 2, 'other': 2, 'several': 2, 'through': 2, 'moderately': 2, 'tend': 2, 'rate': 2, 'mortgage': 2, 'lending': 2, 'residential': 2, 'year': 2, 'terms': 2, 'extent': 2, 'would': 2, 'conduct': 2, 'open': 2, 'operations': 2, 'Government': 2, 'authorities': 2, 'their': 2, 'credit': 2, 'recovery': 2, 'My'

We could use this to find other (nearly) unambiguous cases, collecting enough of examples of each that we could build a classifer to deal with more borderline cases.

For very distinct senses, it is possible to cluster different senses of a words given only a large corpus, an unsupervised task. This is known as Word Sense Induction.

Note that like most unsupervised tasks, this would not label the classes for us - an expert would need to investigate the clusters and determine "these words are "news words", these ones are "sports" words, etc.".

**And here is the reality of all machine learning tasks: they require some manual work at some point.**

<br>

<br>

## Content Analysis

Content analysis is a methodology used primarily in Social Sciences for studying text documents. Typically, it is little more than counting using a large, hand-built lexicon where each word is tagged for various different properties. The [General Inquirer](http://www.wjh.harvard.edu/~inquirer/) is a classic example of this, and is publicly available. Let's use it to do a simple content analysis of individual genres of the Brown corpus.

In [None]:
import pandas as pd

df = pd.read_csv("inquireraugmented.tsv",sep="\t")
df = df.fillna('')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df.head(10)

Unnamed: 0,Entry,Source,Positiv,Negativ,Pstv,Affil,Ngtv,Hostile,Strong,Power,...,Anomie,NegAff,PosAff,SureLw,If,NotLw,TimeSpc,FormLw,Othrtags,Defined
0,,,1915,2291,1045.0,557,1160,833,1902,689.0,...,30.0,193.0,126.0,175.0,132.0,25.0,428.0,368.0,,
1,A,H4Lvd,,,,,,,,,...,,,,,,,,,DET ART,| article: Indefinite singular article--some o...
2,ABANDON,H4Lvd,,Negativ,,,Ngtv,,,,...,,,,,,,,,SUPV,|
3,ABANDONMENT,H4,,Negativ,,,,,,,...,,,,,,,,,Noun,|
4,ABATE,H4Lvd,,Negativ,,,,,,,...,,,,,,,,,SUPV,|
5,ABATEMENT,Lvd,,,,,,,,,...,,,,,,,,,Noun,
6,ABDICATE,H4,,Negativ,,,,,,,...,,,,,,,,,SUPV,|
7,ABHOR,H4,,Negativ,,,,Hostile,,,...,,,,,,,,,SUPV,|
8,ABIDE,H4,Positiv,,,Affil,,,,,...,,,,,,,,,SUPV,|
9,ABILITY,H4Lvd,Positiv,,,,,,Strong,,...,,,,,,,,,Noun,


Let's take a close look at the different features the words can have. There is an explanation of these features [here](http://www.wjh.harvard.edu/~inquirer/homecat.htm), note that there are some reduncancies because the GI is actually an amalgamation of earlier separate lexicons

In [None]:
print(list(df.columns.values))

['Entry', 'Source', 'Positiv', 'Negativ', 'Pstv', 'Affil', 'Ngtv', 'Hostile', 'Strong', 'Power', 'Weak', 'Submit', 'Active', 'Passive', 'Pleasur', 'Pain', 'Feel', 'Arousal', 'EMOT', 'Virtue', 'Vice', 'Ovrst', 'Undrst', 'Academ', 'Doctrin', 'Econ@', 'Exch', 'ECON', 'Exprsv', 'Legal', 'Milit', 'Polit@', 'POLIT', 'Relig', 'Role', 'COLL', 'Work', 'Ritual', 'SocRel', 'Race', 'Kin@', 'MALE', 'Female', 'Nonadlt', 'HU', 'ANI', 'PLACE', 'Social', 'Region', 'Route', 'Aquatic', 'Land', 'Sky', 'Object', 'Tool', 'Food', 'Vehicle', 'BldgPt', 'ComnObj', 'NatObj', 'BodyPt', 'ComForm', 'COM', 'Say', 'Need', 'Goal', 'Try', 'Means', 'Persist', 'Complet', 'Fail', 'NatrPro', 'Begin', 'Vary', 'Increas', 'Decreas', 'Finish', 'Stay', 'Rise', 'Exert', 'Fetch', 'Travel', 'Fall', 'Think', 'Know', 'Causal', 'Ought', 'Perceiv', 'Compare', 'Eval@', 'EVAL', 'Solve', 'Abs@', 'ABS', 'Quality', 'Quan', 'NUMB', 'ORD', 'CARD', 'FREQ', 'DIST', 'Time@', 'TIME', 'Space', 'POS', 'DIM', 'Rel', 'COLOR', 'Self', 'Our', 'You', '

Let's convert this to something a bit easier to work with (and look at), a Python dictionary which contains a set of applicable categories for each word.

> Note: in a `csv.reader()`, you get the output like this where each line is a row:
['Entry', ..., 'SklAsth', 'SklPt', 'SklOth', 'SklTOT', 'TrnGain', 'TrnLoss', 'TranLw', 'MeansLw', 'EndsLw', 'ArenaLw', 'PtLw', 'Nation', 'Anomie', 'NegAff', 'PosAff', 'SureLw', 'If', 'NotLw', 'TimeSpc', 'FormLw', 'Othrtags', 'Defined'],
['', '', '1915', '2291', ... '368', '', ''],
['A', 'H4Lvd', ..., 'DET ART', '| article: Indefinite singular article--some or any one']

In [None]:
from collections import defaultdict
import csv

GI_lexicon = defaultdict(set)

f = open("inquireraugmented.tsv")
reader = csv.reader(f,delimiter='\t')
for line in reader:
    word = line[0].split("#")[0].lower()
    for i in range(2,184):
        if line[i].strip():
            GI_lexicon[word].add(line[i])
f.close()



In [None]:
GI_lexicon["politician"]

{'HU', 'POLIT', 'Polit@', 'PowAuPt', 'PowTot', 'Power', 'Role', 'Strong'}

In [None]:
GI_lexicon["happy"]

{'EMOT', 'Pleasur', 'PosAff', 'Positiv', 'Pstv', 'WlbPsyc', 'WlbTot'}

In [None]:
GI_lexicon["sad"]

{'EMOT', 'Negativ', 'Ngtv', 'Pain', 'Passive', 'WlbPsyc', 'WlbTot'}

Next, we need a function that counts all the categories for a particular corpus (list of words)

In [None]:
from collections import Counter

def get_GI_counts(words):
    '''return a dictionary of counts of occurrence of each GI lexicon
    in a list of words, and the total number of words'''
    counts = Counter()
    total_words = 0
    for word in words:
        total_words += 1
        word = word.lower()
        counts.update(GI_lexicon.get(word,[]))  # if word is not in the set, return []
    return counts,total_words

We can compare individual features across corpora by normalizing the counts. Let's look at features like "EMOT" and "POLIT" which should show a preference for one genre or the other

In [None]:
from nltk.corpus import brown

fict_counts,fict_total = get_GI_counts(brown.words(categories="fiction"))
gov_counts,gov_total = get_GI_counts(brown.words(categories="government"))

In [None]:
fict_counts["EMOT"]/fict_total

0.010264571895806564

In [None]:
gov_counts["EMOT"]/gov_total

0.004121682330961108

In [None]:
fict_counts["POLIT"]/fict_total

0.017930148347155707

In [None]:
gov_counts["POLIT"]/gov_total

0.05175635010054623

These look fairly distinct, but how do we know whether the differences we are seeing actually mean something? Statistical testing! Let's use the counts directly in a [$\chi^2$ test](https://en.wikipedia.org/wiki/Chi-squared_test). 

Just for review, the $\chi^2$ test can be used to test whether it is likely that the same underlying statistical process generated two or more categories of count data. It is one of the most useful statistical tests for computational linguistics applications! To use it we simply need to construct a contingency table of the following form:

| Count type           | Document set 1 | Document set 2 |
|----------------------|----------------|----------------|
| tokens in lexicon     |     1228         |      3629        | 
| tokens not in lexicon |     67260     |       66488         |

Let's fill out this table manually using an example from above.

In [None]:
fict_counts["POLIT"]

1228

In [None]:
fict_total - fict_counts["POLIT"]

67260

In [None]:
gov_counts["POLIT"]

3629

In [None]:
gov_total - gov_counts["POLIT"]

66488

Given a lexical count and a total for two categories, we can use scipy's [chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) to call the function below to construct the table and get a p-value.

In [None]:
from scipy.stats import chi2_contingency

def get_p_value(count1,total1,count2,total2):
    return chi2_contingency([[count1, count2],[total1-count1,total2-count2]])[1]

In [None]:
get_p_value(fict_counts["POLIT"],fict_total,gov_counts["POLIT"],gov_total)

1.0131730384029113e-256

In [None]:
get_p_value(50, 100, 2, 101)

2.7019248198799348e-14

Yup, that is statistically significant! $\chi^2$ doesn't tell us anything about the directionality of the relation, we would need to look at the original ratios for that.

Finally, let's go searching for low p-value lexical classes. Note that we when do many tests like this, we should apply the [Bonferroni](https://en.wikipedia.org/wiki/Bonferroni_correction) or something similar, the normal p < 0.05 cutoff for statistical significance is no longer valid. This won't usually be a problem, though.

In [None]:
def find_highest_ps(counts1,counts2,total1,total2):
    '''given lexicon counts for two document collections, print the top 
    25 lexicons for distinguishing the two collections in terms of their 
    chi-square p-value'''
    to_sort = []
    all_features = set(counts1)
    all_features.update(counts2)
    for feature in all_features:
        to_sort.append((get_p_value(counts1.get(feature,0),total1,counts2.get(feature,0), total2),feature))
    to_sort.sort()
    print(to_sort[:25])


In [None]:
find_highest_ps(fict_counts,gov_counts,fict_total,gov_total)

[(0.0, 'ECON'), (0.0, 'MALE'), (1.0131730384029113e-256, 'POLIT'), (1.0899851560042814e-230, 'Female'), (8.215329763085285e-186, 'WltTot'), (8.559042732884123e-177, 'Econ@'), (3.325303663333495e-170, 'Doctrin'), (3.2405878752386136e-156, 'WltOth'), (1.6231680491880572e-154, 'PowTot'), (1.295343865383213e-107, 'Self'), (2.490830178891491e-103, 'Strong'), (5.517322897976327e-96, 'ABS'), (1.2530683111938407e-92, 'COLL'), (1.8444080804326904e-88, 'BodyPt'), (1.37755034301735e-70, 'Abs@'), (3.475313128766613e-70, 'DAV'), (5.554324680658735e-69, 'PowOth'), (1.52434986952163e-66, 'EndsLw'), (1.5864221396139325e-63, 'Know'), (5.757380211858323e-62, 'Means'), (3.1548427300660787e-60, 'Polit@'), (4.6389068590393685e-60, 'PowAuPt'), (1.542283985697429e-59, 'Intrj'), (4.493380189728371e-56, 'Virtue'), (1.4062380296799672e-52, 'Say')]


In [None]:
fict_counts["MALE"]/fict_total

0.04304403691157575

In [None]:
gov_counts["MALE"]/gov_total

0.00740191394383673

A more modern, more popular lexicon for content analysis is the [Linguistic Inquiry and Word Count (LIWC, pronounced Luke)](http://liwc.wpengine.com/). LIWC is a commercial product and so we won't be distributing it. If you are serious about content analysis, LIWC is the gold standard among general, off-the-shelf approaches.

Let's pick another pair of Brown genres, investigate what GI lexical features distinguish them, and come up with a rationale as to why.

In [None]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [None]:
news_counts, news_total = get_GI_counts(brown.words(categories="news"))
religion_counts, religion_total = get_GI_counts(brown.words(categories="religion"))

find_highest_ps(news_counts,religion_counts, news_total,religion_total)

[(0.0, 'Relig'), (3.85823625266287e-233, 'RcTot'), (6.079184235101566e-213, 'RcRelig'), (7.076536193882808e-94, 'PowAuPt'), (3.9375763748424565e-93, 'Our'), (1.4683528136143864e-92, 'RcEnds'), (1.150379509654069e-65, 'ECON'), (2.257597059311751e-65, 'PowTot'), (4.6061510497257784e-63, 'Econ@'), (3.117469298855974e-62, 'POLIT'), (8.908051704090205e-60, 'Ovrst'), (5.291024512891382e-59, 'Abs@'), (8.043785847291017e-54, 'You'), (3.2851729794897512e-49, 'WltTot'), (2.622544286528954e-48, 'Know'), (2.5573863934419464e-45, 'Polit@'), (4.8540712883023225e-43, 'Passive'), (7.864775080281807e-38, 'Female'), (3.2282757638165437e-36, 'Virtue'), (2.5010186185960975e-35, 'NotLw'), (8.542651243659839e-35, 'WltOth'), (7.824272680956149e-34, 'Intrj'), (1.2648804358134442e-33, 'NatrPro'), (5.740638841422914e-33, 'Negate'), (8.505449903030259e-32, 'Undrst')]


In [None]:
news_counts["Virtue"]/news_total

0.023340692563199872

In [None]:
religion_counts["Virtue"]/religion_total

0.03540699002512754

# Review:

1. WordNet is a network of *synsets*, not words.
2. Each synset in WordNet is associated with one or more lemmas.
3. "I stopped at the bank before shopping" and "my bank account is empty" involves two different senses of the word "bank".
4. To calculate Wupalmer distance you need to identify the lowest common subsumer of two synsets you want to calcuate the distance for.
5. Foot is not a hyponym of leg, but a *meronym* of leg.(because foot is a part of leg)
6. Word sense disambiguation is a important problem in natural language understanding. 
7. If a word has two senses, one of which appear 8 times, the other which appears 2 times, the most frequent sense baseline is 0.8.
8. The Lesk algorithm involves comparing the definition of each sense with the context around the occurrence of the word.
9. To use $\chi^2$ in the context of content analysis, we need to have counts for two different lexicons.