# Retrieving Concepts from Words in NLP

Date: 2023-03-13  
Author: Jason Beach  
Categories: Introduction_Tutorial, Data_Science  
Tags: nlp, word_vector, spacy

<!--eofm-->

<ABSTRACT>
    
* nouns,  NER
* wordnet
* concepts
  - [concise_concepts](https://github.com/Pandora-Intelligence/concise-concepts)
  - ...
* word vectors
  - word2vec
  - vec2word
  - similarity, distribution
  - algebra
  - ...
    
* training entity classifier, [ref](https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy)

## Configure Environment

...

In [81]:
import spacy
nlp = spacy.load("en_core_web_md")    #don't use _sm because it is not true embeddings

In [71]:
import concise_concepts

In [70]:
from nltk.corpus import wordnet as wn

In [3]:
import re

In [14]:
import math, random

In [4]:
#nltk.download('popular')
nltk.corpus.gutenberg.fileids()[:3]

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']

In [5]:
raw = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
spaces = raw.replace('\n\n',' ').replace('\n',' ').replace('\r',' ')

In [6]:
_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
corpus = _RE_COMBINE_WHITESPACE.sub(" ", spaces).strip()

In [36]:
lcorp = len(corpus)
print(f"corpus' characters: {format(lcorp,',d')}")

rand = 178888   #random.randint(100000, 1000000)
rand_sent = corpus[rand : rand+200]
print(f"random sentence: {rand_sent}")

corpus' characters: 1,211,073
random sentence: But I said nothing, only looking round me sharply. Peleg now threw open a chest, and drawing forth the ship's articles, placed pen and ink before him, and seated himself at a little table. I began to 


## Nouns and Entity Recognition

...

Spacy can recognize the following entity types.  This is more useful for getting text content than it is for understanding terms and concepts.

In [8]:
nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("ner").labels, nlp.get_pipe("ner").labels_

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

The amount of information we get from the NER classifier is kinda weak.  The POS seems more useful.

In [55]:
doc = nlp(rand_sent)

In [69]:
for idx,sent in enumerate(doc.sents):
    if idx<2:
        spacy.displacy.render(sent, style='ent', jupyter=True)
        for term in sent:
            if term.pos_=='NOUN':
                print(f'{term} (__NOUN__)', end=' ')
            else:
                print(f'{term}', end=' ')

But I said nothing , only looking round me sharply . 

Peleg now threw open a chest (__NOUN__) , and drawing forth the ship (__NOUN__) 's articles (__NOUN__) , placed pen (__NOUN__) and ink (__NOUN__) before him , and seated himself at a little table (__NOUN__) . 

## WordNet database

WordNet is a lexical database for the English language that categorizes English words into sets of synonyms.  You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.  You can find more information on the [Princeton webpage](https://wordnet.princeton.edu/), which states:

> Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

A thesaurus also groups words by their meaning, but WordNet does two other aspects:

* accounts for words that are used together by semantically separating them
* semantically labels relationships among words

### Synonym set and concepts

Every Synset contains a name, a part-of-speech (nouns, verbs, adverbs, and adjectives), and a number, such as `chest.n.02`.  Synsets are used to store synonyms, where each word in the Synset shares the same meaning.  Some words have just one Synset, while others have multiple Synsets.  Every Synset has a definition associated with it.  Synset makes it easier for users to look up words in the WordNet database.

The `Synset` (synonym set) class is a set of synonyms that share a common meaning.

Let's find the correct type of chest from those available.  The function returns an array containing all the Synsets related to the word passed as the argument.  The `Synset` naming convention includes the name, a part of speech, and how many times the `Synset` has been defined.  It can be obtained using the `synset()` function.  It looks like `chest.n.02` and `chest_of_drawers.n.01` could be where we should be looking.

In [92]:
wn.synsets('chest', pos=wn.NOUN)

[Synset('thorax.n.02'),
 Synset('chest.n.02'),
 Synset('breast.n.01'),
 Synset('chest_of_drawers.n.01')]

In [132]:
wn.synset('chest.n.02').definition(), wn.synset('chest_of_drawers.n.01').definition()

('box with a lid; used for storage; usually large and sturdy',
 'furniture with drawers for keeping clothes')

Iterate over all the noun synsets that are available.

In [115]:
for synset in list(wn.all_synsets('n'))[:10]:
...     print(synset)

Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('abstraction.n.06')
Synset('thing.n.12')
Synset('object.n.01')
Synset('whole.n.02')
Synset('congener.n.03')
Synset('living_thing.n.01')
Synset('organism.n.01')
Synset('benthos.n.02')


In [94]:
[str(lemma.name()) for lemma in wn.synset('chest.n.02').lemmas()]

['chest']

### Parent-child relationships

The most frequently encoded relation among synsets is the super-subordinate relation.  They hypernym is the generic parent of the more specific hyponym children, in this hierarchical data structure.    

* Hypernyms exist in several shapes and sizes, but the Synset is the most popular. The terms hyponym and hypernym are opposed. A Synset’s hypernyms are returned in the form of an array of numbers.
* A Hyponym is a type of Synset that has been modified for a specific purpose instead of a generic Synset.

A synonym is a function returning an array containing all Synsets that form the hyponyms of the Synset passed as an argument to the function.

All noun hierarchies ultimately go up the root node {entity}. Hyponymy relation is transitive: if an armchair is a kind of chair, and if a chair is a kind of furniture, then an armchair is a kind of furniture. 

In [98]:
wn.synset('chest.n.02').hypernyms() 

[Synset('box.n.01')]

In [99]:
wn.synset('chest.n.02').hyponyms()

[Synset('caisson.n.03'),
 Synset('cedar_chest.n.01'),
 Synset('coffer.n.02'),
 Synset('hope_chest.n.01'),
 Synset('pyx.n.01'),
 Synset('sea_chest.n.01'),
 Synset('tea_chest.n.01'),
 Synset('toolbox.n.01'),
 Synset('toy_box.n.01'),
 Synset('treasure_chest.n.01')]

In [100]:
wn.synset('chest.n.02').member_holonyms()

[]

In [101]:
wn.synset('chest.n.02').root_hypernyms()

[Synset('entity.n.01')]

Adjectives are organized in terms of antonymy. Pairs of “direct” antonyms like wet-dry and young-old reflect the strong semantic contract of their members.  Here, it makes sense that chest does not have an opposite, but  

In [103]:
wn.synset('chest.n.02').lemmas()[0].antonyms()

[]

In [151]:
wn.synsets('front', pos=wn.NOUN)[0].lemmas()[0].antonyms()

[Lemma('rear.n.02.rear')]

### Part-whole relationships

Meronyms and Holonyms create a part-to-whole relationship.

The Meronym represents the component, whereas the Holonym represents the whole. As you can see, the Meronym and Holonym both refer to the same thing, but in different ways.

For example, the word ‘bedroom’ is a Meronym for home. This is because the bedroom is considered a component of the house. Likewise, the nose, eyes, and mouth are Meronyms for the face.

In [138]:
wn.synset('chest_of_drawers.n.01').part_meronyms()

[Synset('drawer.n.01'), Synset('shelf.n.01')]

In [139]:
wn.synset('chest.n.02').part_meronyms()

[Synset('lid.n.02')]

In [140]:
wn.synset('lid.n.02').part_holonyms()

[Synset('box.n.01'), Synset('chest.n.02'), Synset('jar.n.01')]

In [141]:
wn.synset('shelf.n.01').part_holonyms()

[Synset('bookcase.n.01'),
 Synset('buffet.n.01'),
 Synset('cabinet.n.01'),
 Synset('chest_of_drawers.n.01'),
 Synset('closet.n.04'),
 Synset('etagere.n.01'),
 Synset('grocery_store.n.01')]

An entailment is similar to an insinuation, a conclusion that can only be derived from something even though it is not specifically expressed.

In [144]:
wn.synset('eat.v.01').entailments()

[Synset('chew.v.01'), Synset('swallow.v.01')]

### Lemmas and similarity

Lemmas are all the words that are in a `Synset`. Using the `Lemma_names()` method, the user can get all lemmas of the specified `Synset`. This method can be used in two different ways to get an array of all the: `wn.synsets()` array and `wn.synset()` with the selected set. 

In [121]:
wn.synsets('chest')[1].lemma_names(), wn.synsets('chest')[3].lemma_names()

(['chest'], ['chest_of_drawers', 'chest', 'bureau', 'dresser'])

In [131]:
wn.synset('chest.n.02').lemma_names(), wn.synset('chest_of_drawers.n.01').lemma_names()

(['chest'], ['chest_of_drawers', 'chest', 'bureau', 'dresser'])

Lemmas can also have relations between them.  These three relations exist only on lemmas, not on synsets.

In [104]:
wn.lemma('vocal.a.01.vocal').derivationally_related_forms()

[Lemma('vocalize.v.02.vocalize')]

In [105]:
wn.lemma('vocal.a.01.vocal').pertainyms()

[Lemma('voice.n.02.voice')]

In [106]:
wn.lemma('vocal.a.01.vocal').antonyms()

[Lemma('instrumental.a.01.instrumental')]

Similarity is available.  The score denotes how similar two word senses are based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.  The score, below, does not appear to be a good similarity representation.

The Leacock-Chodorow Similarity returns a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.

In [111]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
hug = wn.synset('hug.v.01')

In [112]:
hit.path_similarity(slap)

0.14285714285714285

In [113]:
hit.path_similarity(hug)

0.125

In [110]:
hit.lch_similarity(slap)

1.3121863889661687

In [114]:
hit.lch_similarity(hug)

1.1786549963416462

## Zero-shot classifier

When wanting to apply NER to concise concepts, it is really easy to come up with examples, but it takes some effort to train an entire pipeline. Concise Concepts uses few-shot NER based on word embedding similarity to get you going with easy!

In [82]:
data = {
    "furniture": ["locker","table","chair","cabinet"],
    "document": ["papers","documents","forms"],
    "tools": ["pencil","pen","hammer","nail","screwdriver"]
}
options = {"colors": {"furniture": "darkorange", "document": "beige", "tools": "lightbrown"},
           "ents": ["furniture", "document", "tools"]}

nlp.add_pipe("concise_concepts", 
    config={"data": data}
)

<concise_concepts.conceptualizer.Conceptualizer.Conceptualizer at 0x7f1d037e65c0>

In [83]:
doc = nlp(rand_sent)
spacy.displacy.render(doc, style="ent", options=options)

In [None]:
ent.s

In [89]:
ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_}"# ({ent._.ent_score:.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

spacy.displacy.render(doc, style="ent", options=options)

In [87]:
dir(ents[0]._)

['get',
 'has',
 'set',
 'spaczz_counts',
 'spaczz_details',
 'spaczz_ent',
 'spaczz_ratio',
 'spaczz_span',
 'spaczz_type',
 'spaczz_types']

## Word Vectors

...