WordNet KG

An experimental Python pipeline converting the NLTK wordnet into a KG, with additional synonym pairs from dictionaries.

This project generates a graph of synonyms using the WordNet and MWAPPDB datasets. The graph is saved in two files: wordnet_{wn.get_version()}_synonims.tsv.gz and wordnet_{wn.get_version()}_words.tsv.gz, where wn.get_version() is the version of the WordNet dataset being used. The first file contains the edges of the graph, and the second file contains the nodes and their categories (determined by the nltk tagset).

Status of the project

The original goal of this project was to create a graph of synonyms that could be used to expand terms in textual corpora, to be used afterwards in combination with BERT and BM25. BERT is a deep learning model used for natural language processing tasks such as language translation and text classification, while BM25 is a ranking function used in information retrieval to rank documents based on their relevance to a given query. By expanding a term with its synonyms, it may be possible to improve the performance of BERT and BM25 on tasks such as text classification and information retrieval.

However, to the current understanding of the author, it does not seem possible to use the WordNet graph for term expansion due to the lack of synonyms for rare terms in WordNet. These rare terms are exactly the ones that often require expansion, as they are not well-represented in the dataset. The issue is that the Huggingface tokenizer, which is commonly used to preprocess text for BERT, may drop these very rare terms, while it may include their more common synonyms. However, these synonyms may not be available in the WordNet graph. As a result, it may not be possible to use the WordNet graph for term expansion as originally intended.

Current pipeline

The current pipeline for generating the WordNet graph is provided in the Jupyter notebook included in this repository. The notebook contains all of the necessary code and instructions for downloading the required datasets and generating the graph. Simply run the code in the notebook to reproduce the results. If you have any questions or issues with the code, you can refer to the comments in the notebook or contact the repository maintainer for assistance.

Current state of the graph

The graph generated by this project is an undirected graph of synonyms containing 251.39K nodes and 413.71K edges. The graph contains 68.31K connected components, with the largest one containing 80.75K nodes and the smallest one containing a single node. The degree centrality of the nodes ranges from 0 to 257, with the mode degree being 1, the mean degree being 3.29, and the median degree being 2. The nodes with the highest degree centrality are verbs with high frequency in the English language, such as "take," "get," "pass," "break," and "hold."

The node types in this graph refer to the word tags assigned to each node, which represent the part of speech of the word. The word tags are determined by the nltk tagset, which is a standard set of tags used to annotate the part of speech of words in a corpus. Some common word tags include "NOUN" for nouns, "VERB" for verbs, "ADJ" for adjectives, and "ADV" for adverbs.

There are 27 different node types in the graph, with the most common being "UNKNOWN," "NOUN," and "ADJ." There are also 11 nodes with unknown node types and 25.53K singleton nodes in the graph. The graph also contains 21.60K node tuples, 2.92K node triplets, and 1.46K node quadruplets. The graph has a diameter of 3 and a radius of 2, and the average shortest path length is 2.36. The graph is moderately dense, with a density of 0.002.

There are several topological oddities in the graph, including singleton nodes, node tuples, node triplets, and node quadruplets.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Building a WordNet graph.ipynb		Building a WordNet graph.ipynb
LICENSE		LICENSE
README.md		README.md
wordnet_3.0_synonims.tsv.tar.gz		wordnet_3.0_synonims.tsv.tar.gz
wordnet_3.0_words.tsv.tar.gz		wordnet_3.0_words.tsv.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Building a WordNet graph.ipynb

Building a WordNet graph.ipynb

LICENSE

LICENSE

README.md

README.md

wordnet_3.0_synonims.tsv.tar.gz

wordnet_3.0_synonims.tsv.tar.gz

wordnet_3.0_words.tsv.tar.gz

wordnet_3.0_words.tsv.tar.gz

Repository files navigation

WordNet KG

Status of the project

Current pipeline

Current state of the graph

About

Releases

Packages

Languages

License

LucaCappelletti94/wordnet_knowledge_graph

Folders and files

Latest commit

History

Repository files navigation

WordNet KG

Status of the project

Current pipeline

Current state of the graph

About

Resources

License

Stars

Watchers

Forks

Languages