# Getting started with the Java API
The API can be added to your project by either downlowding the jar file or adding the following dependency to your pom.xml: 

In [1]:
%%loadFromPOM
<dependency>
  <groupId>de.tuebingen.uni.sfs.germanet</groupId>
  <artifactId>germanet-api</artifactId>
  <version>13.2.1</version>
</dependency>

The API does some logging, for example when the GermaNet data is loaded, each XML-file will be logged once loaded. If your project does not include any logging library (e.g. logback-classic) you will get a warning message. If you want to see the logs you can add this dependency:

Once you added the GermaNet dependency to your project, you can import the API:

In [2]:
import de.tuebingen.uni.sfs.germanet.api.*;

Whenever you want to use the API, the first thing you would do is to create a GermaNet object as this loads the data and provides access to it. The data (all XML files) have to be stored in one directory which has to be specified as the first argument when you construct the GermaNet object. If you want to run this code, put your XML files in a "germanet_data" directory in your home directory or change the path to the location on your computer.
The API also provides methods to compute semantic similarity / relatedness between words (Synsets). To be able to use all of them you have to provide frequency lists for each word category. These lists can be downloaded from:



In [3]:
String userHome = System.getProperty("user.home");
String data_path = userHome+"/germanet_data";
String nounFreqListPath = data_path + "/noun_freqs_decow14_16.txt";
String verbFreqListPath = data_path + "/verb_freqs_decow14_16.txt";
String adjFreqListPath = data_path + "/adj_freqs_decow14_16.txt";
System.out.println(data_path);
//GermaNet germanet = new GermaNet(data_path, false);
GermaNet germanet = new GermaNet(data_path, nounFreqListPath, verbFreqListPath, adjFreqListPath);
System.out.println("Germanet loaded.")

/Users/nwitte/PycharmProjects/germanetpy/data


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


Germanet loaded.


The data has been loaded and we can now use the API to extract specific information from the data.

# How to inspect information for a single word input
## Inspect synsets
Let's consider the input word *Fußball* 'football'. The following code snippet shows how to extract all synsets given an input wordform. Many words are ambiguous and thus, *Fußball* belongs to two synsets (One represents the ball and one the activity). The string representations include information such as the forms of the corresponding lexical units or sense descriptions (paraphrases), which can be helpful when you want to select a specific meaning for a given word. In this case, let's say we are interested in the second meaning of Fußball, namely the game and not the ball.

In [4]:
List <Synset> fussball_synsets = germanet.getSynsets("Fußball");
// the lengths of the retrieved list is equal to the number of possible senses for a word, in this case 2
System.out.println(String.format("The input has %d senses\n", fussball_synsets.size()));
for (Synset synset : fussball_synsets){
    System.out.println(synset.toString());
}
Synset fussball_synset = germanet.getSynsetByID(21624);

The input has 2 senses

id: 7944, orth forms: [Fußball], paraphrases: Sport, Freizeit: ein Ball zur Ausübung des Sportes
id: 21624, orth forms: [Fußball, Fußballspiel, Fußballsport], paraphrases: Sport, Freizeit, kein Plural: eine beliebte Mannschaftssportart, welche mit 22 Spielern und einem Ball gespielt wird


Every synset has a number of properties that can be extracted. Each synset has a unique id. A synset can have one of three possible word categories (WorbCategory.verben, WordCategory.nomen, WordCategory.adj). 
For each of the word categories the semantic space is divided into a number of semantic fields. (e.g *Besitz*,
*Kommunikation*, *Geschehen*...), called WordClass. The word classes that belong to a certain word category can be looked up.

In [5]:
int id = fussball_synset.getId();
WordCategory word_category = fussball_synset.getWordCategory();
WordClass semantic_field = fussball_synset.getWordClass();
System.out.println(String.format("The synset id is %d; the synset belongs to the word category %s \n and to the semantic field %s.", id, word_category, semantic_field));
// returns the possible WordClasses of verbs
WordCategory.verben.getWordClasses();

The synset id is 21624; the synset belongs to the word category nomen 
 and to the semantic field Geschehen.


[Allgemein, Gefuehl, Gesellschaft, Koerper, natPhaenomen, Perzeption, Besitz, Kognition, Kommunikation, Koerperfunktion, Konkurrenz, Kontakt, Lokation, Schoepfung, Veraenderung, Verbrauch]

Synsets are related to other synsets via conceptual relations (ConRel). The most important relation is the hypernymy
/ hyponymy relation. Direct hypernyms of a synset (one level above) and hyponyms (one level below) can be accessed through the relations has_hypernym and has_hyponym.

In [6]:
List<Synset> direct_hypernyms = fussball_synset.getRelatedSynsets(ConRel.has_hypernym);
List<Synset> direct_hyponyms = fussball_synset.getRelatedSynsets(ConRel.has_hyponym);
Synset hypernym = direct_hypernyms.get(0);
System.out.println("One example hypernym of Fußball is : ");
System.out.println(hypernym.toString());
Synset hyponym = direct_hyponyms.get(0);
System.out.println("One example hyponym of Fußball is : ");
System.out.println(hyponym.toString());


One example hypernym of Fußball is : 
id: 21606, orth forms: [Ballspiel, Ballsportart, Ballsport], paraphrases: Ein Spiel, das mit einem Ball gespielt wird; Eine Sportart, die mit einem Ball betrieben wird
One example hyponym of Fußball is : 
id: 79925, orth forms: [Hallenfußball]


The Synset *Fußball* has exactly one hypernym and several hyponyms. It is also possible to list all hypernyms
from *Fußball* to the top node (root node). The level where the *Fußball* synset is attached to the Graph is called depth and
can also be accessed.


In [7]:
Set all_hypernyms = new HashSet();
List<List<Synset>> transitive_closure = fussball_synset.getTransRelatedSynsets(ConRel.has_hypernym);
for (List<Synset> p : transitive_closure){
    for (Synset hypernym: p){
        if (!hypernym.equals(fussball_synset)) {
        all_hypernyms.add(hypernym.getAllOrthForms());
        }
    }
}

System.out.println(all_hypernyms);
System.out.println(String.format("The synset has a depth of %d and has %d distinct hypernym synsets", fussball_synset.getDepth(), all_hypernyms.size()));

[[Veranstaltung], [qualitative Beziehung], [Sport, Sportdisziplin, Sportart, Disziplin], [Ereignis], [Ballspiel, Ballsportart, Ballsport], [Verhältnis, Relation, Beziehung], [Situation], [Wettkampf, Kampf], [GNROOT], [Sportwettkampf], [Tat, Tätigkeit, Handlung, Aktivität], [Zustand], [Spiel, Sportspiel, Partie, Match], [Geschehen, Geschehnis], [Sportereignis, Sportveranstaltung], [Konflikt, Auseinandersetzung]]
The synset has a depth of 8 and has 16 distinct hypernym synsets


## Use the semantic utils to measure semantic similarity / relatedness
You can also use the API to compare a synset with another synset. These methods work only for two synsets that have the 
same word category, for example for two nouns.
There are two different types of similarity measures.
- path-based measures
- information-content-based measures

Path-based measures compute the semantic relatedness between two concepts based on the shortest path between two synsets. The shortest path is the minimal number of nodes you have to walk from the source synset to the target synset.
Different measures weigh or normalize the path-length in different ways.

To use the measures you first have to create a SemanticUtils object:


In [9]:
SemanticUtils semanticUtils = germanet.getSemanticUtils();


The following example shows, how to use the path-based relatedness measures to check whether *Trompete* 'trumpet' is more related to 
*Posaune* 'trombone' than to *Flöte* 'flute' and how to disambiguate *Flügel* 'wing', 'blade', 'grand' in the context of *Klavier* 'piano'.

In [10]:
Synset trompete = germanet.getSynsets("Trompete").get(0);
System.out.println(trompete.toString());
Synset flöte = germanet.getSynsets("Flöte").get(0);
System.out.println(flöte.toString());
Synset posaune = germanet.getSynsets("Posaune").get(0);
System.out.println(posaune.toString());
Synset klavier = germanet.getSynsets("Klavier").get(0);
System.out.println(klavier.toString());
List<Synset> flügel_synsets = germanet.getSynsets("Flügel");
System.out.println("\nThe word Flügel has the following synsets:");
for (Synset s : flügel_synsets) {
    System.out.println(s);
}

Double trompete_posaune = semanticUtils.getSimilaritySimplePath(trompete, posaune, 10);
Double trompete_flöte = semanticUtils.getSimilaritySimplePath(trompete, flöte, 10);

System.out.println(String.format("\nBased on the simple path measure, is Trompete more similar to Posaune, than to Flöte? %b" , (trompete_posaune > trompete_flöte)));

Double highest_sim_simple = 0.0;
Double highest_sim_leacock = 0.0;
Double highest_sim_wu = 0.0;
Synset most_similar_synset = null;
for (Synset synset : flügel_synsets){
    if (synset.getWordCategory().equals(WordCategory.nomen)){
        Double sim_simple = semanticUtils.getSimilarity(SemRelMeasure.SimplePath, synset, klavier, 1);
        Double sim_leacock = semanticUtils.getSimilarity(SemRelMeasure.LeacockAndChodorow, synset, klavier, 1);
        Double sim_wu = semanticUtils.getSimilarity(SemRelMeasure.WuAndPalmer, synset, klavier, 1);
        System.out.println(String.format("\n These are the similarities between the synset for Klavier and %s : \n Simple Path : %.2f\n Leackock and Chodorow: %.2f\n Wu and Palmer : %.2f", synset.toString(), sim_simple, sim_leacock, sim_wu));
        if (sim_simple > highest_sim_simple && sim_leacock > highest_sim_leacock && sim_wu > highest_sim_wu ){
            highest_sim_simple = sim_simple;
            highest_sim_leacock = sim_leacock;
            highest_sim_wu = sim_wu;
            most_similar_synset = synset;
            }
        }
}
System.out.println(String.format("\nThe most similar synset out of all synsets corresponding to the word 'Flügel' is : %s" , most_similar_synset.toString()));

id: 11590, orth forms: [Trompete], paraphrases: Musik: Hohes Blechblasinstrument mit einem Kesselmundstück
id: 11572, orth forms: [Flöte], paraphrases: ein Blasinstrument, ein Musikinstrument
id: 11589, orth forms: [Posaune], paraphrases: Musik: ein Blechblasinstrument mit einem Zug zum verändern des Tones
id: 11619, orth forms: [Piano, Klavier, Pianoforte], paraphrases: Tasteninstrument, dessen Klang durch Saiten erzeugt wird; Klavier, Abkürzung für Pianoforte. (Am Klavier kann man im Gegensatz zum Cembalo die Dynamik verändern.)

The word Flügel has the following synsets:
id: 9697, orth forms: [Flügel, Seitenflügel], paraphrases: Architektur, Bauwesen: Gebäudeteil; Teilgebäude eines größeren Gebäudekomplexes (wie zum Beispiel eines Schlosses)
id: 12102, orth forms: [Flügel, Rotorblatt], paraphrases: Technik: eine flügelartige Verlängerung an einem Bauteil; Flugwesen, Raumfahrt, Flugzeugbau: ein Teil eines Fluggerätes, eines Raumfahrzeuges
id: 73683, orth forms: [Flügel], paraphrases:

Information-content-based measures augment the structural distances, captured by the taxonomy with word frequencies. Thus the problem of lacking uniform distances in the Graph can be reduced by providing additional information about the typicality of words. The frequencies are used to compute the information content, which graduates concepts from specific to general.
If a very specific synset is compared to a very general one, the relatedness will be low. The relatedness is measured based on the information content of the lowest common subsumer (the lowest synset in the hierachy that is hypernym to both synsets that are compared to each other).

The following code snippet shows the advantage of the IC-based measures. While path-based measures would classify the word pair *Pflanze* 'plant', *Tier* 'animal' as bein almost as similar as the word pair *Roteiche* 'red oak' and *Steineiche* 'holm oak', the IC-based measures distinguish whether two synsets are very general or more specific and consequently assign a higher similarity score to the second word pair.


In [11]:
Synset pflanze = germanet.getSynsetByID(44960);
Synset tier = germanet.getSynsetByID(48805);
Double sim_leacock_pflanze_tier = semanticUtils.getSimilarity(SemRelMeasure.LeacockAndChodorow, pflanze, tier, 1);
Synset roteiche = germanet.getSynsetByID(46054);
Synset steineiche = germanet.getSynsetByID(46056);
Double sim_leacock_roteiche_steineiche = semanticUtils.getSimilarity(SemRelMeasure.LeacockAndChodorow, roteiche, steineiche, 1);
System.out.println(String.format("path-based similarity between Pflanze and Tier: %.2f, between Roteiche and Steineiche %.2f", sim_leacock_pflanze_tier, sim_leacock_roteiche_steineiche));

Double sim_resnik_pflanze_tier = semanticUtils.getSimilarity(SemRelMeasure.Resnik, pflanze, tier, 1);
Double sim_resnik_roteiche_steineiche = semanticUtils.getSimilarity(SemRelMeasure.Resnik, roteiche, steineiche, 1);
System.out.println(String.format("ic-based similarity between Pflanze and Tier: %.2f, between Roteiche and Steineiche %.2f", sim_resnik_pflanze_tier, sim_resnik_roteiche_steineiche));

path-based similarity between Pflanze and Tier: 0,61, between Roteiche and Steineiche 0,69
ic-based similarity between Pflanze and Tier: 0,11, between Roteiche and Steineiche 0,52


For a more convenient search through the ontology and the semantic relatedness computation, you can use the GermaNet web application "Rover":
https://weblicht.sfs.uni-tuebingen.de/rover/

### Inspect Lexical Units
Every synset contains one ore several Lexical Units. The list of Lexical Units (lexunit) can be accessed for any synset. Let's inspect the lexical units for *Fußball* 'football':
We have the lexunit *Fußballspiel* 'football match', the lexunit *Fußball* 'football' and the lexunit *Fußballsport* 'soccer ball'.

In [142]:
List<LexUnit> lexical_units_fussball = fussball_synset.getLexUnits();
System.out.println(lexical_units_fussball);


[id: 29778, orth form: Fußballsport, synset id: 21624, sense: 1, source: core, named entity: false, artificial: false, style marking: false, id: 29776, orth form: Fußballspiel, synset id: 21624, sense: 2, source: core, named entity: false, artificial: false, style marking: false, id: 29777, orth form: Fußball, synset id: 21624, sense: 2, source: core, named entity: false, artificial: false, style marking: false]


Every lexical unit has a number of orthographical forms. There are four different orthographical forms but not every 
lexical unit has an entry for all of them:
* main orth. form: 
* orth. variation
* old orth. form
* old orth. variation

We can see that the lexunit for *Fußball* only has one orth form, but that one of its related synsets *Fußballklub* 'football club' has the 
orthographical variation *Fußballkclub*.


In [143]:
LexUnit fussball_unit = germanet.getLexUnitByID(29777);
List<String> orth_forms = fussball_unit.getOrthForms();
System.out.println(orth_forms);
LexUnit fussballclub_unit = germanet.getLexUnitByID(32423);
List<String> orth_forms = fussballclub_unit.getOrthForms();
System.out.println(orth_forms)

[Fußball]
[Fußballklub, Fußballclub]


*Fußball* is a compound, which are very frequent in the German language. GermaNet stores information about the 
compound, for example that *Fuß* 'foot' is the modifier and *ball* 'ball' is the head.

In [144]:
System.out.println(fussball_unit.getCompoundInfo().toString());


Fuß (Nomen) + Ball


Lexical units are related to other lexical units by different lexical relations. The most common and most general 
lexical relation is synonymy, but there are other relations annotated as well.

For example, for some compounds there has been work
on annotating the relation between the compound and the modifier. In this example the compound *Fußball* has the manner 
functioning of *Fuß*. 

The relations can be unidirectional (e.g. the relation "has manner of functioning" goes from *Fußball*
to *Fuß*, but not the other way around. The relation can also be bidirectional, e.g. *Fußball* and *Fußballspiel* are 
synonyms of each other. If you are interested in finding out which unidirectional relations point towards *Fußball*, these
can be accessed via "incoming_relations":

In [145]:
List <LexUnit> related = fussball_unit.getRelatedLexUnits(LexRel.has_manner_of_functioning, RelDirection.outgoing);
System.out.println(String.format("A related lexical unit related to Fußball via an outgoing relation (has_manner_of_functioning) is, for example :\n%s", related.get(0).toString()));
List <LexUnit> incoming_relations = fussball_unit.getRelatedLexUnits(RelDirection.incoming);
System.out.println(String.format("\n A related lexical unit related to Fußball via an incoming relation (pertainymy) is, for example : \n%s", incoming_relations.get(0).toString()));

A related lexical unit related to Fußball via an outgoing relation (has_manner_of_functioning) is, for example :
id: 35740, orth form: Fuß, synset id: 26149, sense: 3, source: core, named entity: false, artificial: false, style marking: false

 A related lexical unit related to Fußball via an incoming relation (pertainymy) is, for example : 
id: 4226, orth form: fußballerisch, synset id: 2869, sense: 1, source: core, named entity: false, artificial: false, style marking: false


Some lexical units have sense definitions, harvested from the German Wictionary. These can be accessed with the wiktionary_paraphrases field.


In [146]:
fussball_unit.getWiktionaryParaphrases();

[LexUnit ID: 29777, Wiktionary ID: 24602, Wiktionary sense definition: Sport, Freizeit, kein Plural: eine beliebte Mannschaftssportart, welche mit 22 Spielern und einem Ball gespielt wird, edited: false]

Some lexical units have also been linked to the English WordNet. The can be accessed with the ili_records field. 


In [147]:
fussball_unit.getIliRecords();

[LexUnit ID: 29777, EWN relation: synonym, PWN word: association football, PWN 2.0 ID: ENG20-00453585-n, PWN 3.0 ID: ENG30-00478262-n, source: initial
English synonyms from PWN 2.0: soccer
English paraphrase from PWN 2.0: a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal]

Lexical units which are verbs provide information on language use by giving at least one example sentence.
They are also annotated with subcategorisation patterns / verb complementations (frames). For the given example, the frame specifies that the verb can take accusative complements (AN)


In [148]:
LexUnit schiessen = germanet.getLexUnitByID(80272);
System.out.println(schiessen.toString());
System.out.println(schiessen.getExamples());
System.out.println(schiessen.getFrames());

id: 80272, orth form: schießen, synset id: 56962, sense: 4, source: core, named entity: false, artificial: false, style marking: false
[Er hatte den Ball ins Tor geschossen., frame: NN.AN.BD]
[NN.AN.BD]


## How to extract a large number of examples by applying a filter function
If you would like to extract several lexical units or synsets from GermaNet that fulfill a certain number of 
conditions you can create a filter configuration. A filter configuration allows you for example to search for words of specific
Word Classes (e.g. you might be interested in extracting all abstract nouns) or you would like to extract all words that 
contain a specific subword. To do a search you have to create a filter configuration object. You have to pass a search string
as an argument. All other options have defaults but you can set them to adapt your search. 


In [149]:
// we can search for "schuss" but we don't want to care about upper or lowercasing and about different orthforms.
//as a default words with all word categories and word classes will be added
FilterConfig filterconfig = new FilterConfig("schießen");
filterconfig.setIgnoreCase(true);
List<Synset> result = germanet.getSynsets(filterconfig);
System.out.println("filtered result");
for (Synset s : result){
    System.out.println(s.toString());
}
// Let's say we are only interested in synsets of a specific semantic class:
filterconfig.setWordClasses(WordClass.Konkurrenz);
List<Synset> result = germanet.getSynsets(filterconfig);
System.out.println("\nfiltered result, the following synsets belong to WordClass 'Konkurrenz':");
for (Synset s : result){
    System.out.println(s.toString());
}
// if we now filter by word category and use only nouns, our result will be empty because there is not entry for 'schießen' as a noun:
filterconfig.setWordCategories(WordCategory.nomen);
List<Synset> result = germanet.getSynsets(filterconfig);
System.out.println(String.format("\nSize of result is %d" , result.size()));


filtered result
id: 57998, orth forms: [schießen, stürmen, stürzen], paraphrases: mit hoher Geschwindigkeit bewegen
id: 21555, orth forms: [Schießsport, Sportschießen, Schießen]
id: 123485, orth forms: [schießen]
id: 56650, orth forms: [schießen], paraphrases: einen Schuss/Schüsse abgeben; einen Schuss abgeben
id: 59153, orth forms: [schießen, knipsen], paraphrases: dilettantisch fotografieren, Bilder machen
id: 60205, orth forms: [schießen, erlegen], paraphrases: Tiere auf der Jagd erlegen; Jägersprache: ein relativ großes Tier bei der Jagd töten, meist durch einen Schuss (Rehe, Hirschen, etc.); (in Bezug auf Wild) erlegen
id: 56664, orth forms: [schießen], paraphrases: (an einer bestimmten Stelle) mit einem Schuss treffen; mit einem Schuss treffen
id: 56962, orth forms: [schießen], paraphrases: den Ball mit dem Fuß an eine bestimmte Stelle befördern; einen Ball (mit dem Fuss) fortbewegen

filtered result, the following synsets belong to WordClass 'Konkurrenz':
id: 56650, orth forms: 

Besides using full words as search strings we can use regular expressions. This can be very useful if you are interested 
in words with certain character sequences. The next examples shows how to extract all words that end with *kuchen* 'cake', all 
words that contain a whitespace or hyphen (useful for example to extract multiword expressions) and how to extract verbs that contain
'ff' or 'ss':


In [150]:
FilterConfig filterconfig = new FilterConfig(".*kuchen");
filterconfig.setRegEx(true);
filterconfig.setWordCategories(WordCategory.nomen);
List<LexUnit> result = germanet.getLexUnits(filterconfig);
System.out.println(String.format("Found  %d words that end with 'kuchen' in GermaNet \n An example of such is: %s \n Another example is : %s"
      , result.size(), result.get(0).toString(), result.get(10).toString()));

// extract all words that contain a white space or a hyphen
FilterConfig filterconfig = new FilterConfig(".+(\\s|-).+");
filterconfig.setRegEx(true);
filterconfig.setWordCategories(WordCategory.nomen);
List<LexUnit> result = germanet.getLexUnits(filterconfig);
System.out.println(String.format("\nFound  %d multiword expressions with whitespace or hypen in GermaNet \n An example of such is: %s \n Another example is : %s"
      , result.size(), result.get(0).toString(), result.get(10).toString()));

// extract all verbs that contain exactly two 'ss' or two 'ff'
FilterConfig filterconfig = new FilterConfig(".+(f{2,}|s{2,}).+");
filterconfig.setRegEx(true);
filterconfig.setWordCategories(WordCategory.verben);
List<LexUnit> result = germanet.getLexUnits(filterconfig);
System.out.println(String.format("\nFound  %d verbs with double s or double f in GermaNet \n An example of such is: %s \n Another example is : %s"
      , result.size(), result.get(1).toString(), result.get(10).toString()));

Found  54 words that end with 'kuchen' in GermaNet 
 An example of such is: id: 111562, orth form: Buchweizenpfannkuchen, synset id: 82694, sense: 1, source: core, named entity: false, artificial: false, style marking: false 
 Another example is : id: 118233, orth form: Gewürzkuchen, synset id: 87859, sense: 1, source: core, named entity: false, artificial: false, style marking: false

Found  5419 multiword expressions with whitespace or hypen in GermaNet 
 An example of such is: id: 15157, orth form: Offenbarung des Johannes, synset id: 10909, sense: 1, source: core, named entity: false, artificial: false, style marking: false 
 Another example is : id: 66741, orth form: Motten-Königskerze, synset id: 46513, sense: 1, source: core, named entity: false, artificial: false, style marking: false

Found  974 verbs with double s or double f in GermaNet 
 An example of such is: id: 159384, orth form: hineinpressen, synset id: 120024, sense: 1, source: core, named entity: false, artificial: f