To use this project, you will need to download the datasets used in the code. To do so, you can run the following code in a Jupyter notebook cell.

This will open a window where you can select the datasets you want to download. Make sure to select the `wordnet`, `treebank`, `brown`, `nps_chat`, `conll2000`, `dependency_treebank`, `masc_tagged`, `multext_east`, `switchboard`, and `timit_tagged` datasets.

In [151]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Getting tag datasets

In [10]:
import nltk

word_universal_tags = {
    **dict(nltk.corpus.treebank_chunk.tagged_words(tagset="universal")),
    **dict(nltk.corpus.treebank_chunk.tagged_words(tagset="universal")),
    **dict(nltk.corpus.treebank.tagged_words(tagset='universal')),
    **dict(nltk.corpus.brown.tagged_words(tagset='universal')),
    **dict(nltk.corpus.nps_chat.tagged_words(tagset="universal")),
    **dict(nltk.corpus.conll2000.tagged_words(tagset="universal")),
    **dict(nltk.corpus.dependency_treebank.tagged_words()),
    **dict(nltk.corpus.masc_tagged.tagged_words(tagset="universal")),
    **dict(nltk.corpus.multext_east.tagged_words(tagset="universal")),
    **dict(nltk.corpus.switchboard.tagged_words(tagset="universal")),
    **dict(nltk.corpus.timit_tagged.tagged_words(tagset="universal"))
}

We also get a secondary synonyms dictionary, other that wordnet

In [11]:
from nltk.corpus import MWAPPDBCorpusReader
from nltk.corpus.util import LazyCorpusLoader
 
mwa_ppdb = LazyCorpusLoader(
    "mwa_ppdb",
    MWAPPDBCorpusReader,
    r"(?!README|\.).*",
    nltk_data_subdir="misc",
    encoding="utf8",
)

We try to build a first graph.

The graph will be saved in two files: `wordnet_{wn.get_version()}_synonims.tsv` and `wordnet_{wn.get_version()}_words.tsv`, where `wn.get_version()` is the version of the WordNet dataset you are using. The first file contains the edges of the graph, and the second file contains the nodes and their categories (determined by the nltk tagset).

In [15]:
from typing import List
from tqdm.auto import tqdm
from nltk.corpus import wordnet as wn

set_of_nodes = set()

edges = open(f"wordnet_{wn.get_version()}_synonims.tsv", "w")
nodes = open(f"wordnet_{wn.get_version()}_words.tsv", "w")

edges.write("source\tdestination\n")
nodes.write("node_name\tcategory\n")

entries = dict(mwa_ppdb.entries())

other_words = [
    e
    for s, d in entries.items()
    for e in (s, d)
]

for word in tqdm(
    list(wn.words()) + other_words,
    desc="Words",
    leave=False,
    dynamic_ncols=True,
):    
    if len(word.strip()) == 0:
        continue
    synonims = wn.synonyms(word)        
    terms_to_add_to_node_list = [word]
    synonims = [
        s
        for syns in synonims
        for s in syns
    ]

    if word in entries:
        synonims.append(entries[word])
        
    for synonim in synonims:
        if len(synonim.strip()) == 0:
            continue
        terms_to_add_to_node_list.append(synonim)
        edges.write(f"{word}\t{synonim}\n")
    
    category = None
    for term in terms_to_add_to_node_list:
        if term in word_universal_tags:
            category = word_universal_tags[term]
            break
    if category is None:
        category = "UNKNOWN"
    for term in terms_to_add_to_node_list:
        if term in set_of_nodes:
            continue
        if len(term.strip()) == 0:
            continue
        this_term_category = word_universal_tags.get(term, category)
        nodes.write(f"{term}\t{this_term_category}\n")
        set_of_nodes.add(term)

nodes.close()
edges.close()

Words:   0%|                                                                                                  …

We load the graph

In [16]:
from grape import Graph

graph = Graph.from_csv(
    node_path="wordnet_3.0_words.tsv",
    nodes_column="node_name",
    node_list_node_types_column="category",
    edge_path="wordnet_3.0_synonims.tsv",
    directed=False,
    name="WordNet Synonims"
)
graph.enable()
graph

In [17]:
str(graph)

'<div class="graph-report"><style>.graph-report li {margin: 0.5em 0 0.5em 0;}.graph-report .paragraph {text-align: justify;word-break: break-all;}.graph-report .small-columns {column-count: 4;column-gap: 2em;}.graph-report .medium-columns {column-count: 3;column-gap: 2em;}.graph-report .large-columns {column-count: 2;column-gap: 2em;}.graph-report .single-column {}@media only screen and (max-width: 600px) {.graph-report .small-columns {column-count: 1;}.graph-report .medium-columns {column-count: 1;}.graph-report .large-columns {column-count: 1;}}@media only screen and (min-width: 600px) and (max-width: 800px) {.graph-report .small-columns {column-count: 2;}.graph-report .medium-columns {column-count: 1;}.graph-report .large-columns {column-count: 1;}}@media only screen and (min-width: 800px) and (max-width: 1024px) {.graph-report .small-columns {column-count: 3;}.graph-report .medium-columns {column-count: 2;}.graph-report .large-columns {column-count: 1;}}</style><h2>WordNet Synonims