Skip to content

A Python pipeline converting the NLTK wordnet into a KG, with additional synonym pairs from dictionaries.

License

Notifications You must be signed in to change notification settings

LucaCappelletti94/wordnet_knowledge_graph

Repository files navigation

WordNet KG

An experimental Python pipeline converting the NLTK wordnet into a KG, with additional synonym pairs from dictionaries.

This project generates a graph of synonyms using the WordNet and MWAPPDB datasets. The graph is saved in two files: wordnet_{wn.get_version()}_synonims.tsv.gz and wordnet_{wn.get_version()}_words.tsv.gz, where wn.get_version() is the version of the WordNet dataset being used. The first file contains the edges of the graph, and the second file contains the nodes and their categories (determined by the nltk tagset).

Status of the project

The original goal of this project was to create a graph of synonyms that could be used to expand terms in textual corpora, to be used afterwards in combination with BERT and BM25. BERT is a deep learning model used for natural language processing tasks such as language translation and text classification, while BM25 is a ranking function used in information retrieval to rank documents based on their relevance to a given query. By expanding a term with its synonyms, it may be possible to improve the performance of BERT and BM25 on tasks such as text classification and information retrieval.

However, to the current understanding of the author, it does not seem possible to use the WordNet graph for term expansion due to the lack of synonyms for rare terms in WordNet. These rare terms are exactly the ones that often require expansion, as they are not well-represented in the dataset. The issue is that the Huggingface tokenizer, which is commonly used to preprocess text for BERT, may drop these very rare terms, while it may include their more common synonyms. However, these synonyms may not be available in the WordNet graph. As a result, it may not be possible to use the WordNet graph for term expansion as originally intended.

Current pipeline

The current pipeline for generating the WordNet graph is provided in the Jupyter notebook included in this repository. The notebook contains all of the necessary code and instructions for downloading the required datasets and generating the graph. Simply run the code in the notebook to reproduce the results. If you have any questions or issues with the code, you can refer to the comments in the notebook or contact the repository maintainer for assistance.

Current state of the graph

The graph generated by this project is an undirected graph of synonyms containing 251.39K nodes and 413.71K edges. The graph contains 68.31K connected components, with the largest one containing 80.75K nodes and the smallest one containing a single node. The degree centrality of the nodes ranges from 0 to 257, with the mode degree being 1, the mean degree being 3.29, and the median degree being 2. The nodes with the highest degree centrality are verbs with high frequency in the English language, such as "take," "get," "pass," "break," and "hold."

The node types in this graph refer to the word tags assigned to each node, which represent the part of speech of the word. The word tags are determined by the nltk tagset, which is a standard set of tags used to annotate the part of speech of words in a corpus. Some common word tags include "NOUN" for nouns, "VERB" for verbs, "ADJ" for adjectives, and "ADV" for adverbs.

There are 27 different node types in the graph, with the most common being "UNKNOWN," "NOUN," and "ADJ." There are also 11 nodes with unknown node types and 25.53K singleton nodes in the graph. The graph also contains 21.60K node tuples, 2.92K node triplets, and 1.46K node quadruplets. The graph has a diameter of 3 and a radius of 2, and the average shortest path length is 2.36. The graph is moderately dense, with a density of 0.002.

There are several topological oddities in the graph, including singleton nodes, node tuples, node triplets, and node quadruplets.

About

A Python pipeline converting the NLTK wordnet into a KG, with additional synonym pairs from dictionaries.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published