# PyGermanet Tutorial

The following tutorial shows some examples of how to use the Python API for Germanet. Germanet is a lexical-sematic 
net that relates German nouns, verbs and adjectives semantically by grouping lexical units that express
the same concept into synsets. 

With the Python API we can extract synsets and lexical units for a given word and inspect different properties and related
synsets / lexunits. 


In [1]:
from germanet import Germanet
import frames
from filterconfig import Filterconfig
from synset import WordCategory, WordClass
germanet = Germanet("data")

  0%|                                                                         |  1%|#                                                                        |  2%|##                                                                       |  4%|###                                                                      |  5%|####                                                                     |  7%|#####                                                                    |  8%|######                                                                   | 10%|#######                                                                  | 11%|########                                                                 | 12%|#########                                                                | 14%|##########                                                               | 15%|###########                                                              | 17%|############                       

is_container_for l9703 l58497


## How to inspect information for a single word input

### Inspect synsets

Let's consider the input word *Fußball* 'football'. The following shows how to extract all synsets given an input wordform.
Many words are ambiguous and thus, *Fußball* belongs to two synsets. The string representations includes the lexical units, which can be helpful when you want to select
a specific meaning for a given word. In this case, let's say we are interested in the second meaning of
*Fußball*, namely the game and not the ball.


In [47]:
fussball_synsets = germanet.get_synsets_by_orthform("Fußball")
# the lengths of the retrieved list is equal to the number of possible senses for a word, in this case 2
print("The input has %d senses " % len(fussball_synsets))
for synset in fussball_synsets:
    print(synset)
fussball_synset = germanet.get_synset_by_id('s21624')


The input has 2 senses 
Synset(id=s21624, lexunits=Fußballspiel, Fußball, Fußballsport)
Synset(id=s7944, lexunits=Fußball)


Every synset has a number of properties that can be extracted. Each synset has a unique id, which is the character
's' followed by a number. A synset can have one of three possible word categories (verb, noun, adjective). 
For each of the word categories the semantic space is divided into a number of semantic fields. (e.g *Besitz*,
*Kommunikation*, *Geschehen*...). 


In [48]:
id = fussball_synset.id
word_category = fussball_synset.word_category
semantic_field = fussball_synset.word_class
print("The synset id is %s; the synset belongs to the word category %s \n and to the semantic field %s." %
      (id, word_category, semantic_field))


The synset id is s21624; the synset belongs to the word category WordCategory.nomen 
 and to the semantic field WordClass.Geschehen.


Synsets are related to other synsets via conceptual relations. The most important relation is the hypernymy
/ hyponymy relation. Direct hypernyms of a synset (one level above) and hyponyms (one level below) can be accessed through a separate field, all conceptually related 
synsets are stored in the relations field.


In [49]:
print(fussball_synset.direct_hypernyms)
print(fussball_synset.direct_hyponyms)
for relation, synsets in fussball_synset.relations.items():
    print("\nrelation : %s" % relation)
    print(synsets)
    

{Synset(id=s21606, lexunits=Ballsportart, Ballspiel, Ballsport)}
{Synset(id=s137475, lexunits=Klubfußball), Synset(id=s139802, lexunits=Torwandschießen), Synset(id=s21625, lexunits=Profifußball), Synset(id=s21626, lexunits=Frauenfußball, Damenfußball), Synset(id=s133731, lexunits=Weltfußball), Synset(id=s62081, lexunits=Vereinsfußball), Synset(id=s79925, lexunits=Hallenfußball), Synset(id=s124051, lexunits=Freizeitfußball), Synset(id=s69685, lexunits=Amateurfußball), Synset(id=s133871, lexunits=Mädchenfußball), Synset(id=s137940, lexunits=Berufsfußball), Synset(id=s84590, lexunits=Männerfußball), Synset(id=s104374, lexunits=Spitzenfußball), Synset(id=s71210, lexunits=Jugendfußball)}

relation : ConRel.has_hypernym
{Synset(id=s21606, lexunits=Ballsportart, Ballspiel, Ballsport)}

relation : ConRel.has_hyponym
{Synset(id=s137475, lexunits=Klubfußball), Synset(id=s139802, lexunits=Torwandschießen), Synset(id=s21625, lexunits=Profifußball), Synset(id=s21626, lexunits=Frauenfußball, Damenfu

We can see that *Fußball* has exactly one hypernym and several hyponyms. It is also possible to list all hypernyms
from *Fußball* to the top node (root node). The level where the *Fußball* synset is attached to the Graph is called depth and
can also be accessed. We can also check whether *Fußball* is the root or a leaf of the GermaNet graph (although of course
we already know that this is not case, as it has both, hypernyms and hyponyms.


In [50]:
print(fussball_synset.all_hypernyms())
print("The synset has a depth of %d \n is it the root node? %s  \n is it a leaf node? %s"
      % (fussball_synset.min_depth(), str(fussball_synset.is_root()), str(fussball_synset.is_leaf()) ))

{Synset(id=s18413, lexunits=Handlung, Tat, Aktivität, Tätigkeit), Synset(id=s20870, lexunits=Auseinandersetzung, Konflikt), Synset(id=s73180, lexunits=Kampf, Wettkampf), Synset(id=s16557, lexunits=Geschehnis, Geschehen), Synset(id=s17614, lexunits=Veranstaltung), Synset(id=s18275, lexunits=Sportwettkampf), Synset(id=s47458, lexunits=qualitative Beziehung), Synset(id=s18227, lexunits=Sportveranstaltung, Sportereignis), Synset(id=s46926, lexunits=Beziehung, Verhältnis, Relation), Synset(id=s13222, lexunits=Zustand), Synset(id=s51001, lexunits=GNROOT), Synset(id=s18348, lexunits=Spiel, Match, Partie, Sportspiel), Synset(id=s21606, lexunits=Ballsportart, Ballspiel, Ballsport), Synset(id=s16438, lexunits=Ereignis), Synset(id=s21440, lexunits=Sport, Sportart, Disziplin, Sportdisziplin), Synset(id=s16437, lexunits=Situation)}
The synset has a depth of 8 
 is it the root node? False  
 is it a leaf node? False


You can also use the API to compare a synset with another synset. These methods work only for two synsets that have the 
same word category, for example for two nouns. Let's say you would like to know how *Fußball* and *Tennis* are related within 
GermaNet. You first need to extract the synset for *Tennis*. Then you can check whether *Tennis* and *Fußball* share any 
hypernyms and print them. Finally you can extract the shortest path between Fußball and Tennis, i.e. the minimal number of 
nodes you have to walk from *Fußball* to end up at the synset of *Tennis*. You can also extract the distance between *Fußball* and 
*Tennis* (in this case the path length). Synsets that are more similar will have a shorter distance than unrelated synsets.


In [51]:
tennis_synsets = germanet.get_synsets_by_orthform("Tennis")
print(tennis_synsets)
tennis_synset = tennis_synsets[0]
print(fussball_synset.common_hypernyms(tennis_synset))
print(fussball_synset.shortest_path(tennis_synset))
print("Fußball and Tennis have a path distance of %d" % fussball_synset.shortest_path_distance(tennis_synset))


[Synset(id=s21613, lexunits=Tennissport, Tennis, Tennisspiel)]
{Synset(id=s18413, lexunits=Handlung, Tat, Aktivität, Tätigkeit), Synset(id=s73180, lexunits=Kampf, Wettkampf), Synset(id=s16557, lexunits=Geschehnis, Geschehen), Synset(id=s17614, lexunits=Veranstaltung), Synset(id=s47458, lexunits=qualitative Beziehung), Synset(id=s18275, lexunits=Sportwettkampf), Synset(id=s16438, lexunits=Ereignis), Synset(id=s18227, lexunits=Sportveranstaltung, Sportereignis), Synset(id=s46926, lexunits=Beziehung, Verhältnis, Relation), Synset(id=s51001, lexunits=GNROOT), Synset(id=s21606, lexunits=Ballsportart, Ballspiel, Ballsport), Synset(id=s13222, lexunits=Zustand), Synset(id=s18348, lexunits=Spiel, Match, Partie, Sportspiel), Synset(id=s20870, lexunits=Auseinandersetzung, Konflikt), Synset(id=s21440, lexunits=Sport, Sportart, Disziplin, Sportdisziplin), Synset(id=s16437, lexunits=Situation)}
[[Synset(id=s21624, lexunits=Fußballspiel, Fußball, Fußballsport), Synset(id=s21606, lexunits=Ballsportart

### Inspect Lexical Units
Every synset contains one ore several Lexical Units. They are always printed when you acess the String representation of
a synset. The list of Lexical Units (lexunit) can be accessed for any synset. Let's inspect the lexical units for Fußball:
We have the lexunit *Fußballspiel*, the lexunit *Fußball* and the lexunit *Fußballsport*.

In [7]:
lexical_units_fussball = fussball_synset.lexunits
print(lexical_units_fussball)


[Lexunit(id=l11339, orthform=Fußball, synset_id=s7944)]


Every lexical unit has a number of orthographical forms. There are four different orthographical forms but not every 
lexical unit has an entry for all of them:
* main orth. form: 
* orth. variantion
* old orth. form
* old orht. variation

We can see that the lexunit for *Fußball* only has one orth form, but that one of its related synsets *Fußballklub* has the 
orthographical variation *Fußballkclub*.

In [8]:
fussball_unit = germanet.get_lexunit_by_id("l29777")
orth_forms = fussball_unit.get_all_orthforms()
print(orth_forms)
fussballclub_unit = germanet.get_lexunit_by_id("l32423")
orth_forms = fussballclub_unit.get_all_orthforms()
print(orth_forms)
print(fussballclub_unit.orthvar)

{'Fußball'}
{'Fußballklub', 'Fußballclub'}
Fußballclub


*Fußball* is a compound, which are very frequent in the German language. GermaNet stores information about the 
compound, for example that *Fuß* is the modifier and *ball* is the head.


In [52]:
print(fussball_unit.compound_info)


CompoundInfo( modifier = Fuß, head = Ball)


Lexical units are related to other lexical units by different lexical relations. The most common and most general 
lexical relation is synonymy, but there are other relations annotated as well. For example, for some compounds there has been work
on annotating the relation between the compound and the modifier. In this example the compound *Fußball* has the manner 
functioning of *Fuß*. The relations can be unidirectional (e.g. the relation "has manner of functioning" goes from *Fußball*
to *Fuß*, but not the other way around. The relation can also be bidirectional, e.g. *Fußball* and *Fußballspiel* are 
synonyms of each other. If you are interested in finding out which unidirectional relations point towards *Fußball*, these
can be accessed via "incoming_relations":


In [53]:
print(fussball_unit.relations)
print(fussball_unit.incoming_relations)


defaultdict(<class 'set'>, {<LexRel.has_synonym: 'has_synonym'>: {Lexunit(id=l29778, orthform=Fußballsport, synset_id=s21624), Lexunit(id=l29776, orthform=Fußballspiel, synset_id=s21624)}, <LexRel.has_manner_of_functioning: 'has_manner_of_functioning'>: {Lexunit(id=l35740, orthform=Fuß, synset_id=s26149)}})
defaultdict(<class 'set'>, {<LexRel.has_pertainym: 'has_pertainym'>: {Lexunit(id=l4226, orthform=fußballerisch, synset_id=s2869)}, <LexRel.has_specialization: 'has_specialization'>: {Lexunit(id=l53360, orthform=Fußballamateur, synset_id=s37146)}, <LexRel.has_active_usage: 'has_active_usage'>: {Lexunit(id=l10294, orthform=Fußballschuh, synset_id=s7143), Lexunit(id=l13796, orthform=Fußballstadion, synset_id=s9891)}, <LexRel.has_topic: 'has_topic'>: {Lexunit(id=l88379, orthform=Fußballschule, synset_id=s63191)}})


Some lexical units have sense definitions, harvested from the German Wictionary. These can be accessed with the wiktionary_paraphrases field.

In [54]:
print(fussball_unit.wiktionary_paraphrases)

[Wictionary(LexUnit ID=l29777, definition=Sport, Freizeit, kein Plural: eine beliebte Mannschaftssportart, welche mit 22 Spielern und einem Ball gespielt wird)]


Some lexical units have also been linked to the English WordNet. The can be accessed with the ili_records field. 

In [55]:
print(fussball_unit.ili_records)


[IliRecord(LexUnit ID=l29777, relation=synonym, english_equivalent=association football)]


Lexical units which are verbs provide information on language use by giving at least one example sentence.
They are also annotated with subcategorisation patterns / verb complementations (frames). It is possible to extract
verbs with specific complements of interest, for example if you're interested in all verbs that allow accusative complements
you can extract them with specific methods, defined in the frames class. 


In [56]:
schiessen = germanet.get_lexunit_by_id("l80272")
print(schiessen)
print(schiessen.examples)
print(schiessen.frames)
f = frames.Frames(germanet.frames2lexunits)
all_verbs_with_accusative_complement = f.extract_accusative_complemtent()
print("There are  %d verbs that can take an accusative complement in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(all_verbs_with_accusative_complement), all_verbs_with_accusative_complement.pop(), all_verbs_with_accusative_complement.pop()))


Lexunit(id=l80272, orthform=schießen, synset_id=s56962)
['Er hatte den Ball ins Tor geschossen.']
['NN.AN.BD']
There are  11090 verbs that can take an accusative complement in GermaNet 
 An example of such is: Lexunit(id=l141965, orthform=durchkosten, synset_id=s106827) 
 Another example is : Lexunit(id=l79571, orthform=verweisen, synset_id=s56451)


## How to extract a large number of examples by applying a filter function
If you would like to extract several lexical units or synsets from GermaNet that fulfill a certain number of 
conditions you can create a filter configuration. A filter configuration allows you for example to search for words of specific
Word Classes (e.g. you might be interested in extracting all abstract nouns) or you would like to extract all words that 
contain a specific subword. To do a search you have to create a filter configuration object. You have to pass a search string
as an argument. All other options have defaults but you can set them to adapt your search. 


In [57]:
# we can search for "schuss" but we don't want to care about upper or lowercasing and about different orthforms:
filterconfig = Filterconfig("schießen", ignore_case=True)
result = filterconfig.filter_synsets(germanet)
print("filtered result")
for word in result:
    print(word)
# Let's say we are only interested in synsets of a specific semantic class:
filterconfig.word_classes = [WordClass.Konkurrenz]
result = filterconfig.filter_synsets(germanet)
print("\nfiltered result")
for word in result:
    print(word, word.word_class)
# if we now filter by word category and use only nouns, our result will be empty because there is not entry for 'schießen' as a noun:
filterconfig.word_categories = [WordCategory.nomen]
result = filterconfig.filter_synsets(germanet)
print("\nfiltered result")
print(result)

filtered result
Synset(id=s56664, lexunits=schießen)
Synset(id=s123485, lexunits=schießen)
Synset(id=s21555, lexunits=Schießsport, Sportschießen, Schießen)
Synset(id=s59153, lexunits=knipsen, schießen)
Synset(id=s57998, lexunits=stürmen, stürzen, schießen)
Synset(id=s60205, lexunits=erlegen, schießen)
Synset(id=s56650, lexunits=schießen)
Synset(id=s56962, lexunits=schießen)

filtered result
Synset(id=s56650, lexunits=schießen) WordClass.Konkurrenz
Synset(id=s56664, lexunits=schießen) WordClass.Konkurrenz

filtered result
set()


Besides using full words as search strings we can use regular expressions. This can be very useful if you are interested 
in words with certain character sequences. The next examples shows how to extract all words that end with "kuchen", all 
words that contain a whitespace or hyphen (useful for example to extract multiword expressions) and how to extract verbs that contain
'ff' or 'ss':

In [58]:
# extract all words that end with 'kuchen'
filterconfig = Filterconfig('.*kuchen', regex=True)
filterconfig.word_categories = [WordCategory.nomen]
result = filterconfig.filter_lexunits(germanet)
print("Found  %d words that end with 'kuchen' in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(result), result.pop(), result.pop()))

# extract all words that contain a white space or a hyphen
filterconfig = Filterconfig('.+(\s|-).+', regex=True)
filterconfig.word_categories = [WordCategory.nomen]
result = filterconfig.filter_lexunits(germanet)
print("\nFound  %d multiword expressions with whitespace or hypen in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(result), result.pop(), result.pop()))

# extract all verbs that contain exactly two 'ss' or two 'ff'
filterconfig = Filterconfig('.+(f{2,}|s{2,}).+', regex=True)
filterconfig.word_categories = [WordCategory.verben]
result = filterconfig.filter_lexunits(germanet)
print("\nFound  %d verbs with double s or double f in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(result), result.pop(), result.pop()))


Found  60 words that end with 'kuchen' in GermaNet 
 An example of such is: Lexunit(id=l117636, orthform=Filterkuchen, synset_id=s87408) 
 Another example is : Lexunit(id=l57634, orthform=Nusskuchen, synset_id=s39235)

Found  5183 multiword expressions with whitespace or hypen in GermaNet 
 An example of such is: Lexunit(id=l92787, orthform=Belize City, synset_id=s43847) 
 Another example is : Lexunit(id=l44251, orthform=plattdeutscher Dialekt, synset_id=s32155)
{Lexunit(id=l161058, orthform=vorlassen, synset_id=s122408), Lexunit(id=l78492, orthform=messen, synset_id=s55626), Lexunit(id=l85818, orthform=auffressen, synset_id=s61091), Lexunit(id=l85132, orthform=herauslassen, synset_id=s60572), Lexunit(id=l77089, orthform=abschlaffen, synset_id=s54602), Lexunit(id=l160720, orthform=hinschaffen, synset_id=s122122), Lexunit(id=l154941, orthform=abkassieren, synset_id=s117379), Lexunit(id=l78269, orthform=missbilligen, synset_id=s55465), Lexunit(id=l84415, orthform=passivieren, synset_id=s