# PyGermanet Tutorial

The following tutorial shows some examples of how to use the Python API for Germanet. Germanet is a lexical-sematic 
net that relates German nouns, verbs and adjectives semantically by grouping lexical units that express
the same concept into synsets. 

With the Python API we can extract synsets and lexical units for a given word and inspect different properties and related
synsets / lexunits. To use the API you can install it with pip:



In [4]:
import sys
!{sys.executable} -m pip install -U germanetpy

Collecting germanetpy
[?25l  Downloading https://files.pythonhosted.org/packages/12/16/d17862422eae401706fc31b0f5eb3d45d18bef211a3b284167c18cd24935/germanetpy-0.2.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 1.1MB/s eta 0:00:01
Installing collected packages: germanetpy
  Found existing installation: germanetpy 0.2.0
    Uninstalling germanetpy-0.2.0:
      Successfully uninstalled germanetpy-0.2.0
Successfully installed germanetpy-0.2.1


Whenever you want to use the API, the first thing you do is to create a `Germanet` object, which loads the data and provides access to it. The data (all XML files) have to be stored in one directory, whose path has to be specified as the first argument when you construct the GermaNet object. 

If you want to run this code, put your XML files in a "germanet_data" directory in your home directory or change the path to the location on your computer. The API also provides methods to compute semantic similarity / relatedness between words (Synsets). To be able to use all of them you have to provide frequency lists for each word category. These lists can be downloaded from: <!-- where? -->

In [6]:
from pathlib import Path
from germanetpy.germanet import Germanet

data_path = str(Path.home()) + "/germanet/GN_V150/GN_V150_XML"
frequencylist_nouns = str(Path.home()) + "/germanet/GN_V150/FreqLists/noun_freqs_decow14_16.txt"
germanet = Germanet(data_path)

Load GermaNet data...: 100%|█████████▉| 99.99999999999996/100 [00:07<00:00, 13.21it/s] 
Load Wiktionary data...: 100%|██████████| 100.0/100 [00:00<00:00, 269.70it/s]             
Load Ili records...: 100%|██████████| 100.0/100 [00:00<00:00, 294.76it/s]


The data has been loaded and we can now use the API to extract specific information from the data.

## How to inspect synset information for a single word input

Let's consider the input word *Fußball* 'football'. The following shows how to extract all synsets given an input word. Many words are ambiguous; *Fußball* belongs to two synsets. 

In [7]:
fussball_synsets = germanet.get_synsets_by_orthform("Fußball")
# the length of the retrieved list is equal to the number of possible senses for a word, in this case 2
len(fussball_synsets)

2

The string representations include the lexical units, which can be helpful when you want to select
a specific meaning for a given word.

In [8]:
for synset in fussball_synsets:
    print(synset)

Synset(id=s7944, lexunits=Fußball)
Synset(id=s21624, lexunits=Fußballspiel, Fußball, Fußballsport)


In this case, let's say we are interested in the sense of *Fußball* which is synonymous with *Fußballspiel* &mdash; that is, the game rather than the ball.

In [9]:
fussball_synset = germanet.get_synset_by_id('s21624')

Every synset has a number of properties that can be extracted. Each synset has a unique id, which is the character
's' followed by a number.

In [10]:
fussball_synset.id

's21624'

A synset can have one of three possible word categories (verb, noun, adjective). 

In [11]:
fussball_synset.word_category

<WordCategory.nomen: 2>

For each of the word categories the semantic space is divided into a number of semantic fields. (e.g *Besitz*,
*Kommunikation*, *Geschehen*...), called `word_class`. 


In [12]:
fussball_synset.word_class

<WordClass.Geschehen: 21>

Synsets are related to other synsets via conceptual relations. The most important relation is the hypernymy
/ hyponymy relation. Direct hypernyms of a synset (one level above) and hyponyms (one level below) can be accessed through separate fields:

In [13]:
fussball_synset.direct_hypernyms

{Synset(id=s21606, lexunits=Ballspiel, Ballsport, Ballsportart)}

In [14]:
fussball_synset.direct_hyponyms

{Synset(id=s104374, lexunits=Spitzenfußball),
 Synset(id=s124051, lexunits=Freizeitfußball),
 Synset(id=s133731, lexunits=Weltfußball),
 Synset(id=s133871, lexunits=Mädchenfußball),
 Synset(id=s137475, lexunits=Klubfußball),
 Synset(id=s137940, lexunits=Berufsfußball),
 Synset(id=s139802, lexunits=Torwandschießen),
 Synset(id=s145820, lexunits=Kombinationsfußball),
 Synset(id=s146603, lexunits=Straßenfußball),
 Synset(id=s21625, lexunits=Profifußball),
 Synset(id=s21626, lexunits=Frauenfußball, Damenfußball),
 Synset(id=s62081, lexunits=Vereinsfußball),
 Synset(id=s69685, lexunits=Amateurfußball),
 Synset(id=s71210, lexunits=Jugendfußball),
 Synset(id=s79925, lexunits=Hallenfußball),
 Synset(id=s84590, lexunits=Männerfußball)}

All conceptually related synsets are stored in the `relations` field:


In [15]:
for relation, synsets in fussball_synset.relations.items():
    print("\nRelation: %s" % relation)
    print(synsets)


Relation: ConRel.has_hypernym
{Synset(id=s21606, lexunits=Ballspiel, Ballsport, Ballsportart)}

Relation: ConRel.has_hyponym
{Synset(id=s69685, lexunits=Amateurfußball), Synset(id=s104374, lexunits=Spitzenfußball), Synset(id=s137475, lexunits=Klubfußball), Synset(id=s133871, lexunits=Mädchenfußball), Synset(id=s79925, lexunits=Hallenfußball), Synset(id=s145820, lexunits=Kombinationsfußball), Synset(id=s21626, lexunits=Frauenfußball, Damenfußball), Synset(id=s133731, lexunits=Weltfußball), Synset(id=s71210, lexunits=Jugendfußball), Synset(id=s21625, lexunits=Profifußball), Synset(id=s137940, lexunits=Berufsfußball), Synset(id=s146603, lexunits=Straßenfußball), Synset(id=s62081, lexunits=Vereinsfußball), Synset(id=s124051, lexunits=Freizeitfußball), Synset(id=s84590, lexunits=Männerfußball), Synset(id=s139802, lexunits=Torwandschießen)}

Relation: ConRel.is_related_to
{Synset(id=s20943, lexunits=Handspiel), Synset(id=s15474, lexunits=Ablöse), Synset(id=s18513, lexunits=Länderspieleinsat

We can see that *Fußball* has exactly one hypernym and several hyponyms. It is also possible to list all <!-- transitive? --> hypernyms
from *Fußball* to the top node (root node).

In [16]:
fussball_synset.all_hypernyms()

{Synset(id=s13222, lexunits=Zustand),
 Synset(id=s16437, lexunits=Situation),
 Synset(id=s16438, lexunits=Ereignis),
 Synset(id=s16557, lexunits=Geschehen, Geschehnis),
 Synset(id=s17614, lexunits=Veranstaltung),
 Synset(id=s18227, lexunits=Sportveranstaltung, Sportereignis),
 Synset(id=s18275, lexunits=Sportwettkampf),
 Synset(id=s18348, lexunits=Spiel, Match, Partie, Sportspiel),
 Synset(id=s18413, lexunits=Handlung, Tat, Aktivität, Tätigkeit),
 Synset(id=s20870, lexunits=Auseinandersetzung, Konflikt),
 Synset(id=s21440, lexunits=Sport, Sportart, Disziplin, Sportdisziplin),
 Synset(id=s21606, lexunits=Ballspiel, Ballsport, Ballsportart),
 Synset(id=s46926, lexunits=Beziehung, Verhältnis, Relation),
 Synset(id=s47458, lexunits=qualitative Beziehung),
 Synset(id=s51001, lexunits=GNROOT),
 Synset(id=s73180, lexunits=Kampf, Wettkampf)}

The level where the *Fußball* synset is attached to the Graph is called its depth. <!-- maybe explain here why the method is called min_depth, namely, a synset can have multiple depths because it can have multiple hypernyms? -->


In [17]:
fussball_synset.min_depth()

8

We can also check whether *Fußball* is the root or a leaf of the GermaNet graph (although of course
we already know that this is not case, as it has both hypernyms and hyponyms).

In [18]:
fussball_synset.is_root()

False

In [19]:
fussball_synset.is_leaf()

False

## Use the semantic utils to measure semantic similarity / relatedness

You can also use the API to compare a synset with another synset. These methods work only for two synsets that have the same word category, for example for two nouns. There are two different types of similarity measures:
- path-based measures
- information-content-based measures

Path-based measures compute the semantic relatedness between two concepts based on the shortest path between two synsets in the hypernym relation. The shortest path length is the minimal number of nodes forming a path between the two synsets in the relation. Different measures weigh or normalize the path-length in different ways.

First we will look at the simple path distance between two synsets.

Let's say you would like to know how *Fußball* and *Tennis* are related within 
GermaNet. You first need to extract the synset for *Tennis*. Then you can check whether *Tennis* and *Fußball* share any hypernyms.

In [20]:
tennis_synsets = germanet.get_synsets_by_orthform("Tennis")
tennis_synset = tennis_synsets[0]
fussball_synset.common_hypernyms(tennis_synset)

{Synset(id=s13222, lexunits=Zustand),
 Synset(id=s16437, lexunits=Situation),
 Synset(id=s16438, lexunits=Ereignis),
 Synset(id=s16557, lexunits=Geschehen, Geschehnis),
 Synset(id=s17614, lexunits=Veranstaltung),
 Synset(id=s18227, lexunits=Sportveranstaltung, Sportereignis),
 Synset(id=s18275, lexunits=Sportwettkampf),
 Synset(id=s18348, lexunits=Spiel, Match, Partie, Sportspiel),
 Synset(id=s18413, lexunits=Handlung, Tat, Aktivität, Tätigkeit),
 Synset(id=s20870, lexunits=Auseinandersetzung, Konflikt),
 Synset(id=s21440, lexunits=Sport, Sportart, Disziplin, Sportdisziplin),
 Synset(id=s21606, lexunits=Ballspiel, Ballsport, Ballsportart),
 Synset(id=s46926, lexunits=Beziehung, Verhältnis, Relation),
 Synset(id=s47458, lexunits=qualitative Beziehung),
 Synset(id=s51001, lexunits=GNROOT),
 Synset(id=s73180, lexunits=Kampf, Wettkampf)}

You can then extract the shortest path you can walk from *Fußball* to end up at *Tennis*. 

In [21]:
fussball_synset.shortest_path(tennis_synset)

[[Synset(id=s21624, lexunits=Fußballspiel, Fußball, Fußballsport),
  Synset(id=s21606, lexunits=Ballspiel, Ballsport, Ballsportart),
  Synset(id=s21613, lexunits=Tennis, Tennisspiel, Tennissport)]]

You can also extract the distance between *Fußball* and *Tennis* (in this case the path length). Synsets that are more similar will have a shorter distance than unrelated synsets.

In [22]:
fussball_synset.shortest_path_distance(tennis_synset)

2

### Example for path-based measures

The following example shows how to use path-based semantic relatedness measures to check whether *Trompete* (trumpet) is more closely related to *Posaune* (trombone) than to *Flöte* (flute) and how to disambiguate *Flügel* (wing, blade, grand) in the context of *Klavier* (piano). 

To use the path-based semantic relatedness measures you have to initialize a `PathBasedRelatedness` object. This object takes the longest possible shortest distance and a Synset pair that is maximally apart as argument. If not given, this synset pair will be computed on the fly, but the computation might take some time, especially for nouns.

As mentioned above, these measures only work for synsets that belong to the same word category, which has to be specified in the constructor.

In [23]:
from germanetpy.path_based_relatedness_measures import PathBasedRelatedness
from germanetpy.synset import WordCategory

# First, construct a path-based similarity object. 
# The johannis_wurm and leber_trans synsets are maximally far apart among nouns:
johannis_wurm = germanet.get_synset_by_id("s49774")
leber_trans = germanet.get_synset_by_id("s83979")
relatedness_calculator = PathBasedRelatedness(germanet=germanet, category=WordCategory.nomen, max_len=35,
                                              max_depth=20, synset_pair=(johannis_wurm, leber_trans))

We can now use the `relatedness_calculator` object to find out whether *Trompete* (trumpet) is more closely related to *Posaune* (trombone) or to *Flöte* (flute):

In [24]:
trompete = germanet.get_synsets_by_orthform("Trompete").pop()
flöte = germanet.get_synsets_by_orthform("Flöte").pop()
posaune = germanet.get_synsets_by_orthform("Posaune").pop()
trompete_posaune = relatedness_calculator.simple_path(trompete, posaune)
trompete_flöte = relatedness_calculator.simple_path(trompete, flöte)

trompete_posaune > trompete_flöte 

True

Path based relatedness measures can also be used to disambiguate word senses. This example shows how to find the sense of *Flügel* (wing, blade, grand) which is most similar to *Klavier* (piano) according to three different path-based measures:. 

In [25]:
Klavier = germanet.get_synsets_by_orthform("Klavier").pop()
Flügel_synsets = germanet.get_synsets_by_orthform("Flügel")

results = []
highest_sim_simple = 0.0
highest_sim_leacock = 0.0
highest_sim_wu = 0.0
most_similar_synset = None

for synset in Flügel_synsets:
    if synset.word_category == WordCategory.nomen:
        sim_simple = relatedness_calculator.simple_path(synset, Klavier, normalize=True)
        sim_leacock = relatedness_calculator.leacock_chodorow(synset, Klavier, normalize=True)
        sim_wu = relatedness_calculator.wu_and_palmer(synset, Klavier, normalize=True)
        results.append([
            synset.id, 
            ", ".join(lu.orthform for lu in synset.lexunits),
            sim_simple, 
            sim_leacock, 
            sim_wu
        ])
        
        if sim_simple > highest_sim_simple and sim_leacock > highest_sim_leacock and sim_wu > highest_sim_wu :
            highest_sim_simple = sim_simple
            highest_sim_leacock = sim_leacock
            highest_sim_wu = sim_wu
            most_similar_synset = synset

most_similar_synset

Synset(id=s11625, lexunits=Flügel)

We can verify that this is the most similar synset by looking at all the similarity results in a table:

In [26]:
from IPython.display import Markdown as md

results_header = ("| Synset ID | Lex Units | Simple Path | Leacock and Chodorow | Wu and Palmer |\n" +
                  "|-----------|-----------|-------------|----------------------|---------------|\n")

results_table = results_header + "".join(["|{}|{}|{}|{}|{}|\n".format(*result) for result in results])
md(results_table)

| Synset ID | Lex Units | Simple Path | Leacock and Chodorow | Wu and Palmer |
|-----------|-----------|-------------|----------------------|---------------|
|s26446|Flügel, Schwinge|0.68571|0.30657|0.26667|
|s73683|Flügel|0.68571|0.30657|0.35294|
|s11625|Flügel|0.94286|0.69343|0.88889|
|s12102|Flügel, Rotorblatt|0.8|0.41972|0.53333|
|s9697|Flügel, Seitenflügel|0.74286|0.35745|0.4|
|s73727|Flügel|0.77143|0.38685|0.42857|
|s23151|Parteiflügel, Flügel|0.65714|0.28424|0.33333|


### Example for IC-based measures

One problem with path-based measures is that paths of the same length in the hypernym relation can correspond to very different intuitive semantic "distances".
Measures based on *information content* (IC) seek to solve this problem by augmenting information about the structural distances in the hypernym relation with information about word frequencies. 
The word frequencies are used to compute the information content, which grades concepts from more specific to more general. If a very specific synset is compared to a very general one, the relatedness will be low. The relatedness of two synsets is measured based on the information content of their least common subsumer (the lowest synset in the hierarchy that is hypernym to both synsets).

To use these measures, you have to create an `ICBasedSimilarity` object that takes frequency lists as an additional argument. These lists contain the raw frequencies of the nouns, adjectives and verbs that are in Germanet, based on a very large corpus. You can use either the provided frequency lists or your own lists.

In [27]:
from germanetpy.icbased_similarity import ICBasedSimilarity

relatedness_nouns = ICBasedSimilarity(germanet=germanet, 
                                      wordcategory=WordCategory.nomen,
                                      path=frequencylist_nouns)

The following code snippet shows the advantage of the IC-based measures. While path-based measures would classify the words *Pflanze* 'plant' and *Tier* 'animal' as being almost as similar as the words *Roteiche* 'red oak' and *Steineiche* 'holm oak', the IC-based measures distinguish whether two synsets are very general or more specific and consequently assign a higher similarity score to the second pair of words.

In [28]:
# first word pair:
pflanze = germanet.get_synset_by_id("s44960")
tier = germanet.get_synset_by_id("s48805")

# second word pair:
roteiche = germanet.get_synset_by_id("s46054")
steineiche = germanet.get_synset_by_id("s46056")

Notice that a path-based measure between these word pairs yields almost the same results:

In [29]:
relatedness_calculator.leacock_chodorow(pflanze, tier, normalize=True)

0.61315

In [30]:
relatedness_calculator.leacock_chodorow(roteiche, steineiche, normalize=True)

0.69343

But an IC-based measure clearly distinguishes the two pairs:

In [31]:
relatedness_nouns.resnik(pflanze, tier, normalize=True)

0.10985

In [32]:
relatedness_nouns.resnik(roteiche, steineiche, normalize=True)

0.51592

For a more convenient search through the ontology and the semantic relatedness computation, you can use the GermaNet web application "Rover":
https://weblicht.sfs.uni-tuebingen.de/rover/

## Inspect Lexical Units
Every synset contains one ore several Lexical Units. The list of Lexical Units (lexunit) can be accessed for any synset. Let's inspect the lexical units for *Fußball* 'football':
We have the lexunit *Fußballspiel* 'football match', the lexunit *Fußball* 'football' and the lexunit *Fußballsport* 'soccer'.

In [33]:
fussball_synset.lexunits

[Lexunit(id=l29776, orthform=Fußballspiel, synset_id=s21624),
 Lexunit(id=l29777, orthform=Fußball, synset_id=s21624),
 Lexunit(id=l29778, orthform=Fußballsport, synset_id=s21624)]

Every lexical unit has a number of orthographical forms. There are four different orthographical forms but not every 
lexical unit has an entry for all of them:
* main orth. form
* orth. variation
* old orth. form
* old orth. variation

We can see that the lexunit for *Fußball* only has one orth form, but that one of its related synsets *Fußballklub* 'football club' has the 
orthographical variation *Fußballkclub*.

In [34]:
fussball_unit = germanet.get_lexunit_by_id("l29777")
fussball_unit.get_all_orthforms()

{'Fußball'}

In [35]:
fussballclub_unit = germanet.get_lexunit_by_id("l32423")
fussballclub_unit.get_all_orthforms()

{'Fußballclub', 'Fußballklub'}

In [36]:
fussballclub_unit.orthvar

'Fußballclub'

*Fußball* is a compound noun, which are very frequent in the German language. GermaNet stores information about the 
compound, for example that *Fuß* 'foot' is the modifier and *ball* 'ball' is the head.


In [37]:
fussball_unit.compound_info

CompoundInfo( modifier = Fuß, head = Ball)

Lexical units are related to other lexical units by different lexical relations. The most common and most general 
lexical relation is synonymy (i.e., the relation which groups lexical units into synsets), but there are other lexical relations in GermaNet as well. For example, for some compounds there has been work
on annotating the relation between the compound and the modifier. In this example the compound *Fußball* has the manner of functioning *Fuß*. 

In [38]:
fussball_unit.relations

defaultdict(set,
            {<LexRel.has_synonym: 'has_synonym'>: {Lexunit(id=l29776, orthform=Fußballspiel, synset_id=s21624),
              Lexunit(id=l29778, orthform=Fußballsport, synset_id=s21624)},
             <LexRel.has_manner_of_functioning: 'has_manner_of_functioning'>: {Lexunit(id=l35740, orthform=Fuß, synset_id=s26149)}})

The relations can be unidirectional (e.g., the relation "has manner of functioning" goes from *Fußball*
to *Fuß*, but not the other way around). The relations can also be bidirectional (e.g., *Fußball* and *Fußballspiel* are synonyms of each other). If you are interested in finding out which unidirectional relations point towards *Fußball*, these can be accessed via "incoming_relations":

In [39]:
fussball_unit.incoming_relations

defaultdict(set,
            {<LexRel.has_pertainym: 'has_pertainym'>: {Lexunit(id=l4226, orthform=fußballerisch, synset_id=s2869)},
             <LexRel.has_specialization: 'has_specialization'>: {Lexunit(id=l53360, orthform=Fußballamateur, synset_id=s37146)},
             <LexRel.has_active_usage: 'has_active_usage'>: {Lexunit(id=l10294, orthform=Fußballschuh, synset_id=s7143),
              Lexunit(id=l13796, orthform=Fußballstadion, synset_id=s9891)},
             <LexRel.has_topic: 'has_topic'>: {Lexunit(id=l88379, orthform=Fußballschule, synset_id=s63191)}})

Some lexical units have sense definitions, harvested from the German Wiktionary. These can be accessed with the `wiktionary_paraphrases` field.

In [40]:
fussball_unit.wiktionary_paraphrases

[Wiktionary(LexUnit ID=l29777, definition=Sport, Freizeit, kein Plural: eine beliebte Mannschaftssportart, welche mit 22 Spielern und einem Ball gespielt wird)]

Some lexical units have also been linked to the English WordNet. The can be accessed with the `ili_records` field. 

In [41]:
fussball_unit.ili_records

[IliRecord(LexUnit ID=l29777, relation=synonym, english_equivalent=association football)]

Lexical units which are verbs provide information on language use by giving at least one example sentence.
They are also annotated with subcategorisation patterns / verb complementations (frames).


In [42]:
schiessen = germanet.get_lexunit_by_id("l80272")
schiessen

Lexunit(id=l80272, orthform=schießen, synset_id=s56962)

In [43]:
schiessen.examples

['Er hatte den Ball ins Tor geschossen.']

In [44]:
schiessen.frames

['NN.AN.BD']

It is possible to extract verbs with specific complements of interest. For example, if you're interested in all verbs that allow accusative complements, you can extract them with specific methods, defined in the `Frames` class. 

In [45]:
from germanetpy.frames import Frames

f = Frames(germanet.frames2lexunits)
all_verbs_with_accusative_complement = f.extract_accusative_complement()

In [46]:
# How many verbs take accusative complements?
len(all_verbs_with_accusative_complement)

11735

In [47]:
# What are some examples?
(all_verbs_with_accusative_complement.pop(), all_verbs_with_accusative_complement.pop())

(Lexunit(id=l131160, orthform=herausbringen, synset_id=s97523),
 Lexunit(id=l112387, orthform=zerdehnen, synset_id=s83331))

## How to extract a large number of examples by applying a filter function
If you would like to extract several lexical units or synsets from GermaNet that fulfill certain conditions you can create a filter configuration. For example, filter configurations allow you to search for words of specific
Word Classes (e.g. you might be interested in extracting all abstract nouns) or to extract all words that 
contain a specific subword. To perform a search you have to create a filter configuration object. You have to pass a search string as an argument. All other options have defaults but you can override these defaults to refine your search.


For example, we can search for *schießen* 'shoot' but ignore upper or lowercasing in different orthforms:

In [48]:
from germanetpy.filterconfig import Filterconfig

filterconfig = Filterconfig("schießen", ignore_case=True)
filterconfig.filter_synsets(germanet)

{Synset(id=s123485, lexunits=schießen),
 Synset(id=s21555, lexunits=Schießen, Schießsport, Sportschießen),
 Synset(id=s56650, lexunits=schießen),
 Synset(id=s56664, lexunits=schießen),
 Synset(id=s56962, lexunits=schießen),
 Synset(id=s57998, lexunits=stürmen, stürzen, schießen),
 Synset(id=s59153, lexunits=knipsen, schießen),
 Synset(id=s60205, lexunits=erlegen, schießen)}

Let's now limit the results to synsets of a specific semantic class:

In [49]:
from germanetpy.synset import WordClass

filterconfig.word_classes = [WordClass.Konkurrenz]
filtered_result = filterconfig.filter_synsets(germanet)
[(synset, synset.word_class) for synset in filtered_result]

[(Synset(id=s56664, lexunits=schießen), <WordClass.Konkurrenz: 33>),
 (Synset(id=s56650, lexunits=schießen), <WordClass.Konkurrenz: 33>)]

If we now filter by word category and use only nouns, our result will be empty because there is no entry for 'schießen' as a noun:

In [50]:
filterconfig.word_categories = [WordCategory.nomen]
result = filterconfig.filter_synsets(germanet)
result

set()

Besides using full words as search strings we can use regular expressions. This can be very useful if you are interested 
in words with certain character sequences. The next example shows how to extract all words that end with "kuchen":

In [51]:
filterconfig = Filterconfig('.*kuchen', regex=True)
result = filterconfig.filter_lexunits(germanet)
print("Found  %d words that end with 'kuchen' in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(result), result.pop(), result.pop()))

Found  54 words that end with 'kuchen' in GermaNet 
 An example of such is: Lexunit(id=l57615, orthform=Lebkuchen, synset_id=s39220) 
 Another example is : Lexunit(id=l57639, orthform=Sandkuchen, synset_id=s39239)


This example extracts all nouns that contain whitespace or a hyphen (useful for example to extract multiword expressions):

In [52]:
# extract all nouns that contain whitespace or a hyphen
filterconfig = Filterconfig('.+(\s|-).+', regex=True)
filterconfig.word_categories = [WordCategory.nomen]
result = filterconfig.filter_lexunits(germanet)
print("\nFound  %d multiword expressions with whitespace or hypen in GermaNet \n An example of such is: %s \n Another example is: %s"
      % (len(result), result.pop(), result.pop()))


Found  5419 multiword expressions with whitespace or hypen in GermaNet 
 An example of such is: Lexunit(id=l100608, orthform=Alb-Donau-Kreis, synset_id=s73656) 
 Another example is: Lexunit(id=l42662, orthform=Tai Lue, synset_id=s31219)


And this example extracts verbs that contain 'ff' or 'ss':   

In [53]:
# extract all verbs that contain exactly two 'ss' or two 'ff'
filterconfig = Filterconfig('.+(f{2,}|s{2,}).+', regex=True)
filterconfig.word_categories = [WordCategory.verben]
result = filterconfig.filter_lexunits(germanet)
print("\nFound  %d verbs with double s or double f in GermaNet \n An example of such is: %s \n Another example is : %s"
      % (len(result), result.pop(), result.pop()))


Found  974 verbs with double s or double f in GermaNet 
 An example of such is: Lexunit(id=l79706, orthform=entwaffnen, synset_id=s56556) 
 Another example is : Lexunit(id=l85783, orthform=prassen, synset_id=s61062)
