Skip to content
This repository has been archived by the owner on Apr 23, 2020. It is now read-only.

Latest commit

 

History

History
103 lines (67 loc) · 4.43 KB

README.md

File metadata and controls

103 lines (67 loc) · 4.43 KB

SemWiktionary

Java API to locally access data from Wiktionary, a collaboratively-edited, free dictionary. Specific target is the French Wiktionary.

Goal

Offer a complete graph database and a Java API to access it that provides the following information about a word:

  1. Definitions.
  2. Semantical relations with other words, such as synonyms, antonyms, paronyms…
  3. Lexical class (or “part of speech”). — not available yet

Example

import edu.unice.polytech.kis.semwiktionary.model.Word;


Word hello = Word.find("bonjour");	// database lookup

for (Word salutation : hello.getSynonyms()) {
	System.out.println(salutation + " world!");	// all variants of “hello world!”…
	
	System.out.print("Most usually used in the context of: "); // …with the domain (usage context, e.g. “sociology”)…
	System.out.println(hello.getDefinitions().get(0).getDomain()); // …of their most common meaning
}

How to use

Remember that we are currently offering support only for the French Wiktionary. This software has not been tested with any other language. You are most welcome to try and contribute support for other languages, though!

Acquiring the API

Download the latest build from the downloads page.

All necessary dependencies are in the lib folder, and the API itself is available as a JAR at the archive's root.

Acquiring an already parsed database

This is clearly the preferred method, as it will allow you to skip the long task of parsing the Wiktionary yourself. As long as our servers can handle the load, you can download the full French Wiktionary database (80 MB ZIP).

You will then have to move the contents of the archive in a data folder in the deflated API archive, in order to get the following file hierarchy:

┲SemWiktionary (deflated API archive)
├  SemWiktionary.jar
├  wiktionary
├┬ lib
 ┋ (…many jars…)
├┬ data (deflated database archive)
 ┋ (…many "neostore" files…)

Lookup

For testing or a basic usage, you can simply use the lookup interface this way:

./wiktionary	# interactive, or:
./wiktionary [wordToLookUp [anotherWord [...]]]

To integrate SemWiktionary within your own application, or export the data in any format you wish, use the provided API. Its documentation is available in the doc/javadoc folder of the archive. You will need to include the SemWiktionary JAR and and all those in the lib folder, and provide a data folder containing the database at the root of your project folder.

Tweaking the parser

If you are interested in modifying the parser, generate your own database and so on, download the source and read doc/Parser/How to parse a dump file.md.

Equivalent projects and rationale

  • JWKTL. Not documented, source code access was not allowed by authors (yet?).

Several tools parse MediaWiki markup and create an AST from it. However, most of them are both overkill and not specific enough for the Wiktionary dialects (much more structured than Wikipedia, for which most tools are tailored).

License

GNU General Public License.

Contact authors for a different licensing request.

Credits

Authors

Tutors

  • Michel Gautero
  • Carine Fédèle

Used projects

  • Neo4j graph database (GPL)
  • JFlex Java lexer by Gerwin Klein (GPL)
  • Markdown doclet documentation parser by Richard Nichols (GPLv2)
  • JUnit unit-testing framework (CPL)
  • Unitils extensions for JUnit (Apache 2)
  • Gwtwiki converter wiki text to plain text (EPLv1)

Miscellaneous