<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/11.nlp/ExploreWordNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/11.nlp/ExploreWordNet.ipynb)

# Exploring WordNet

This notebook explores WordNet synsets, presenting a simple method for finding in a text all mentions of all hyponyms of a given node in the WordNet hierarchy (e.g., finding all *buildings* in a text).  Upload this notebook at the end of class.

In [1]:
import re

import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Get the synsets for a given word.  The synsets here are roughly ordered by frequency of use (in a small tagged dataset), so that more frequent senses occur first.

In [2]:
for synset in wn.synsets('speak'):
    print (synset, synset.definition())

Synset('talk.v.02') express in speech
Synset('talk.v.01') exchange thoughts; talk with
Synset('speak.v.03') use language
Synset('address.v.02') give a speech to
Synset('speak.v.05') make a characteristic or natural sound


Get the words/phrases in that synset.

In [3]:
for lemma in wn.synset("talk.v.01").lemmas():
    print (lemma.name())

talk
speak


In [4]:
# Functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms
hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

Find all of the synsets that are hyponyms of the target synset (*descendents* in the WordNet hierarchy)

In [5]:
list(wn.synset("talk.v.02").closure(hypo))

[Synset('cackle.v.01'),
 Synset('chatter.v.05'),
 Synset('speak_up.v.02'),
 Synset('rasp.v.02'),
 Synset('tone.v.02'),
 Synset('generalize.v.02'),
 Synset('slur.v.03'),
 Synset('whisper.v.01'),
 Synset('yack.v.01'),
 Synset('babble.v.01'),
 Synset('rant.v.01'),
 Synset('hiss.v.03'),
 Synset('mumble.v.01'),
 Synset('bark.v.01'),
 Synset('verbalize.v.01'),
 Synset('drone.v.02'),
 Synset('read.v.03'),
 Synset('deliver.v.01'),
 Synset('gulp.v.02'),
 Synset('shout.v.01'),
 Synset('blubber.v.02'),
 Synset('blurt_out.v.01'),
 Synset('snivel.v.01'),
 Synset('whiff.v.05'),
 Synset('snap.v.01'),
 Synset('bumble.v.03'),
 Synset('sing.v.02'),
 Synset('murmur.v.01'),
 Synset('open_up.v.07'),
 Synset('chatter.v.04'),
 Synset('troll.v.07'),
 Synset('lip_off.v.01'),
 Synset('peep.v.04'),
 Synset('enthuse.v.02'),
 Synset('speak_in_tongues.v.01'),
 Synset('swallow.v.04'),
 Synset('tone.v.01'),
 Synset('begin.v.04'),
 Synset('talk_of.v.01'),
 Synset('bay.v.01'),
 Synset('vocalize.v.05'),
 Synset('call.v.

Find all of the synsets that are hyperyms (*ancestors* up the tree) of the target synset

In [6]:
list(wn.synset("communicate.v.02").closure(hyper))

[Synset('interact.v.01'), Synset('act.v.01')]

In [7]:
def get_words_in_hypo(synset):
    """ Returns a list of words/phrases that comprise the hyponyms of a synset.
    """
    words=set()
    hyponym_synsets=list(synset.closure(hypo))
    hyponym_synsets.append(synset)
    for synset in hyponym_synsets:
        for l in synset.lemmas():
            word=l.name()
            word=re.sub("_", " ", word)
            words.add(word)

    return words

In [8]:
get_words_in_hypo(wn.synset("color.n.01"))

{"Davy's gray",
 "Davy's grey",
 'Indian red',
 'Paris green',
 'Prussian blue',
 'Turkey red',
 'Tyrian purple',
 'Vandyke brown',
 'Venetian red',
 'achromasia',
 'achromatic color',
 'achromatic colour',
 'alabaster',
 'alizarine red',
 'amber',
 'apatetic coloration',
 'aposematic coloration',
 'apricot',
 'aqua',
 'aquamarine',
 'ash gray',
 'ash grey',
 'azure',
 'beige',
 'black',
 'blackness',
 'bleach',
 'blond',
 'blonde',
 'blondness',
 'blue',
 'blue green',
 'blueness',
 'bluish green',
 'bone',
 'bottle green',
 'brick red',
 'brown',
 'brownish yellow',
 'brownness',
 'buff',
 'burgundy',
 'burnt sienna',
 'burnt umber',
 'canary',
 'canary yellow',
 'caramel',
 'caramel brown',
 'cardinal',
 'carmine',
 'carnation',
 'cerise',
 'cerulean',
 'chalk',
 'charcoal',
 'charcoal gray',
 'charcoal grey',
 'chartreuse',
 'cherry',
 'cherry red',
 'chestnut',
 'chocolate',
 'chromatic color',
 'chromatic colour',
 'chromatism',
 'chrome green',
 'chrome red',
 'claret',
 'coal b

In [9]:
def find_all_words_in_text(words, spacy_tokens):
    """ For a given set of words, find each instance among a list of tokens already
    processed by spacy.  Returns a list of token indexes that match.  (Note this only
    identifies single words, not multi-word phrases.)
    """
    all_matches=[]
    for idx, token in enumerate(spacy_tokens):
        if token.lemma_ in words:
            all_matches.append(idx)
    return all_matches

In [10]:
def print_concordance(matches, spacy_tokens, window=3):
    """ For a given set of token indexes, prints out a window of words around each match,
    in the style of a concordance.
    """

    RED="\x1b[31m"
    BLACK="\x1b[0m"

    spacing=window*10
    for match in matches:
        start=match-window
        end=match+window+1
        if start < 0:
            start=0
        if end > len(spacy_tokens):
            end=len(spacy_tokens)
        pre=' '.join([token.text for token in spacy_tokens[start:match]])
        post=' '.join([token.text for token in spacy_tokens[match+1:end]])
        print("%s %s%s%s %s" % (pre.rjust(spacing), RED, spacy_tokens[match].text, BLACK, post))

In [11]:
def read_text(filename):
    """ Read a text, replacing all whitespace sequences with a single space.
    """
    with open(filename, encoding="utf-8") as file:
        return re.sub(r"\s+", " ", file.read())

In [12]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/1342_pride_and_prejudice.txt

--2025-11-05 00:40:26--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/1342_pride_and_prejudice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 691804 (676K) [text/plain]
Saving to: ‘1342_pride_and_prejudice.txt’


2025-11-05 00:40:27 (15.1 MB/s) - ‘1342_pride_and_prejudice.txt’ saved [691804/691804]



In [13]:
book = read_text("1342_pride_and_prejudice.txt")

In [14]:
import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [15]:
spacy_tokens = nlp(book)

In [16]:
def wordnet_search(synset, spacy_tokens):
    """ This functions searchs through all of the tokens in the spacy_tokens argument to find
    any mention of words in the synset or any of its hyponyms.
    """
    targets = get_words_in_hypo(synset)
    matches = find_all_words_in_text(targets, spacy_tokens)
    print_concordance(matches, spacy_tokens)

**Q1.** Let's do a very coarse tagging of a document to find all of the mentions of a specific WordNet synset and all of its hyponyms. Using the functions above, find all of the color terms in *Pride and Prejudice*.

In [17]:
wordnet_search(wn.synset("color.n.01"), spacy_tokens)

                     he wore a [31mblue[0m coat , and
                    and rode a [31mblack[0m horse . An
                   a bottle of [31mwine[0m a day .
                     I liked a [31mred[0m coat myself very
                  given to her [31mcomplexion[0m , and doubt
              till summoned to [31mcoffee[0m . She was
                 walking , the [31mtone[0m of her voice
                   with a fine [31mcomplexion[0m and good -
                   , but their [31mcolour[0m and shape ,
             Nicholls has made [31mwhite[0m soup enough ,
                        is _ a [31mshade[0m in a character
            reject the offered [31molive[0m - branch .
                   idea of the [31molive[0m - branch perhaps
                     come in a [31mscarlet[0m coat , and
                  in any other [31mcolour[0m . As for
                 In a softened [31mtone[0m she declared herself
                . Both changed [31mcolour[0m , one

**Q2.** Find all of the vehicles mentioned in *Pride and Prejudice*.

In [18]:
wordnet_search(wn.synset("vehicle.n.01"), spacy_tokens)

                   Monday in a [31mchaise[0m and four to
                    not keep a [31mcarriage[0m , and had
                     ball in a [31mhack[0m chaise . ”
                     in a hack [31mchaise[0m . ” “
                    I have the [31mcarriage[0m ? ” said
                Mr. Bingley 's [31mchaise[0m to go to
                     go in the [31mcoach[0m . ” “
                could have the [31mcarriage[0m . ” Elizabeth
                  , though the [31mcarriage[0m was not to
               offered her the [31mcarriage[0m , and she
                  offer of the [31mchaise[0m to an invitation
        afterwards ordered her [31mcarriage[0m . Upon this
                     to give a [31mflat[0m denial , and
                in general and [31mordinary[0m cases between friend
                  beg that the [31mcarriage[0m might be sent
             possibly have the [31mcarriage[0m before Tuesday ;
                Mr. Bingley 's [31mcarriag

**Q3.** Find all of the verbs of speaking in *Pride and Prejudice*.

In [27]:
wordnet_search(wn.synset("speak.v.01"), spacy_tokens)

                    that he is [31mconsidered[0m the rightful property
                   and four to [31msee[0m the place ,
                    to see the [31mplace[0m , and was
                   how can you [31mtalk[0m so ! But
                         ” “ I [31msee[0m no occasion for
                 indeed go and [31msee[0m Mr. Bingley when
                       ” “ But [31mconsider[0m your daughters .
                  very glad to [31msee[0m you ; and
               take delight in [31mvexing[0m me . You
                   and live to [31msee[0m many young men
                 , he suddenly [31maddressed[0m her with :
             contain herself , [31mbegan[0m scolding one of
                      “ Do you [31mconsider[0m the forms of
                    know , and [31mread[0m great books and
                would not have [31mcalled[0m on him .
                    over , she [31mbegan[0m to declare that
                       , as he [31mspoke

**Q4.** Find all of the people in *Pride and Prejudice*.

In [35]:
wn.synsets("had")

[Synset('have.v.01'),
 Synset('have.v.02'),
 Synset('experience.v.03'),
 Synset('own.v.01'),
 Synset('get.v.03'),
 Synset('consume.v.02'),
 Synset('have.v.07'),
 Synset('hold.v.03'),
 Synset('have.v.09'),
 Synset('have.v.10'),
 Synset('have.v.11'),
 Synset('have.v.12'),
 Synset('induce.v.02'),
 Synset('accept.v.02'),
 Synset('receive.v.01'),
 Synset('suffer.v.02'),
 Synset('have.v.17'),
 Synset('give_birth.v.01'),
 Synset('take.v.35')]

In [37]:
for synset in wn.synsets("had"):
  print(synset.definition())

have or possess, either in a concrete or an abstract sense
have as a feature
go through (mental or physical states or experiences)
have ownership or possession of
cause to move; cause to be in a certain position or condition
serve oneself to, or consume regularly
have a personal or business relationship with someone
organize or be responsible for
have left
be confronted with
undergo
suffer from; be ill with
cause to do; cause to act in a specified manner
receive willingly something given or offered
get something; come into possession of
undergo (as of injuries and illnesses)
achieve a point or goal
cause to be born
have sex with; archaic use


In [34]:
wn.synsets("person")

[Synset('person.n.01'), Synset('person.n.02'), Synset('person.n.03')]

In [31]:
wordnet_search(wn.synset("person.n.01"), spacy_tokens)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                        . “ We [31mhave[0m not determined how
               No scheme could [31mhave[0m been more agreeable
                       Oh , my [31mdear[0m , dear aunt
                     my dear , [31mdear[0m aunt , ”
                   dear , dear [31maunt[0m , ” she
                 give me fresh [31mlife[0m and vigour .
                What are young [31mmen[0m to rocks and
                  young men to [31mrocks[0m and mountains ?
                 be like other [31mtravellers[0m , without being
                 know where we [31mhave[0m gone -- we
             recollect what we [31mhave[0m seen . Lakes
          quarreling about its [31mrelative[0m situation . Let
             the generality of [31mtravellers[0m . ” Chapter
                     ; for she [31mhad[0m seen her sister
                  had seen her [31msister[0m looking so well
                     , and the [

**Q5.** The methods above all identify *any* mentions of a WordNet synset in a text  -- e.g., every instance of *bank* would be identified as a hit for query bank.n.01 ("sloping land ..."), even if its specific word sense in context was the financial institution (or even a verb).

How might we improve on this method?