testperanto tutorial 5: our first fake sentences
------------------------------------------------------

In this tutorial, we will show how to use `testperanto` to generate simple declarative sentences inspired by German grammar. As a first step, we will build a word generator whose role is to create novel German-esque words. We start by providing a list of syllables to a `ListBasedWordGenerator`, which will generate a syllable uniformly at random when we call its `.generate` method.

In [1]:
from testperanto.wordgenerators import ListBasedWordGenerator, AtomBasedWordGenerator
from testperanto.distributions import CategoricalDistribution

german_syllables = ['flach', 'stau', 'bei', 'der', 'dich', 'dung', 'mein', 
                    'fin', 'frisch', 'frau', 'geh', 'glied', 'gun', 'gnug' 'haf', 'han', 'heim'
                    'her', 'herr', 'hub', 'lag', 'hung', 'jahr', 'keit', 'kol', 'kom', 'kenn',
                    'kon', 'lang', 'lich', 'ler', 'lung', 'man', 'mensch', 'milch', 'mon', 'nach', 
                    'nied', 'par', 'rech', 'rich', 'run', 'rung', 'schlag', 'sam', 
                    'schmid', 'sich', 'ster', 'sung', 'tag', 'tel', 'ter', 'tik', 'trum', 'tun', 
                    'tung', 'run', 'ver', 'vor', 'wir', 'wohn', 'zer', 'ziem', 'zum']

syllable_generator = ListBasedWordGenerator(german_syllables)
for _ in range(5):
    print(syllable_generator.generate())

lag
run
nach
lang
lag


Next, we construct an `AtomBasedWordGenerator`, whose constructor takes two arguments:
1. A WordGenerator that produces atomic building blocks.
2. A Distribution over word lengths (i.e. how many atoms to concatenate into a word)

In [2]:
stem_generator = AtomBasedWordGenerator(syllable_generator,
                                        CategoricalDistribution([0, 0, 0.4, 0.4, 0.1]))

for _ in range(5):
    print(stem_generator.generate())

nachtag
stertunrung
kolruntun
nachkeit
schlaglung


Seems to be generating some German-isch words! Let's register the WordGenerator with testperanto, so we can more easily use it to generate German word stems.

In [3]:
from testperanto.wordgenerators import register_word_generator

register_word_generator("german-stems", stem_generator) 

Once we've registered it, we can easily access it using the `lookup_word_generator` function.

In [4]:
from testperanto.wordgenerators import lookup_word_generator

wordgen = lookup_word_generator("german-stems")
for _ in range(5):
    print(wordgen.generate())

runrechterdung
hublagpar
zerschlaggun
flachjahr
mansam


Now that we can generate words, let's create a *voicebox*. The role of the voicebox is to render generic word descriptions into actual words. By "generic word description," we mean something like the following tree.

In [5]:
from testperanto.trees import TreeNode
tree = TreeNode.from_str("(@nn (STEM noun.27) (COUNT plu))")
print(tree)

(@nn (STEM noun.27) (COUNT plu))


This is an abstract way to say "the 27th noun of our vocabulary, expressed as a plural." To render this, we can use a `MorphologyVoicebox`.

In [6]:
from testperanto.voicebox import MorphologyVoicebox

vbox = MorphologyVoicebox(lookup_word_generator('german-stems'))
print(vbox.run(tree))

dergnughaf


The `MorphologyVoicebox` will remember that `noun.27` is associated with the chosen word, so if we run it again, we get the same thing, but if we try a different noun (like `noun.39`), it will come up with a new word.

In [7]:
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.27) (COUNT plu))")))
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.39) (COUNT plu))")))

dergnughaf
ziemfin


So far we haven't done anything to express the fact that the nouns are plural. In English, this is typically done by adding the letter "s". A `MorphologyVoicebox` allows us to specify morphers that sequentially modify a stem to express syntactic properties like plurality.

In [8]:
from testperanto.morphology import EnglishNounMorpher

vbox = MorphologyVoicebox(lookup_word_generator('german-stems'), [EnglishNounMorpher()])
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.21) (COUNT sng))")))
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.21) (COUNT plu))")))

telkennsich
telkennsichs


But German, although it sometimes uses "s" as a plural, more often uses other endings, like "en". We can create a custom morpher as follows:

In [9]:
from testperanto.morphology import SuffixMorpher

noun_morpher = SuffixMorpher(property_names=['COUNT'],
                             suffix_map={('sng',): '', ('plu',): 'en'})
vbox = MorphologyVoicebox(lookup_word_generator('german-stems'), [noun_morpher])
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.47) (COUNT sng))")))
print(vbox.run(TreeNode.from_str("(@nn (STEM noun.47) (COUNT plu))")))

tagmein
tagmeinen


German verbs acquire suffixes based on the person (1st, 2nd, 3rd) and the count (sng, plu). We can create another morpher and voicebox for verbs.

In [10]:
verb_morpher = SuffixMorpher(property_names=['PERSON', 'COUNT'],
                             suffix_map={('1', 'sng'): 'e', ('1', 'plu'): 'en',
                                         ('2', 'sng'): 'st', ('2', 'plu'): 't',
                                         ('3', 'sng'): 't', ('3', 'plu'): 'en'})
vbox = MorphologyVoicebox(lookup_word_generator('german-stems'), [verb_morpher])
print("ich " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 1) (COUNT sng))"))))
print("du  " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 2) (COUNT sng))"))))
print("es  " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 3) (COUNT sng))"))))
print("wir " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 1) (COUNT plu))"))))
print("ihr " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 2) (COUNT plu))"))))
print("sie " + str(vbox.run(TreeNode.from_str("(@vb (STEM verb.47) (PERSON 3) (COUNT plu))"))))

ich hanheimhere
du  hanheimherst
es  hanheimhert
wir hanheimheren
ihr hanheimhert
sie hanheimheren


We can capture these two voicebox functionalities by creating a `ManagingVoicebox`, which delegates the rendering of generic word representations to helper voiceboxes.

In [11]:
from testperanto.voicebox import ManagingVoicebox

vbox = ManagingVoicebox()
vbox.delegate('vb', MorphologyVoicebox(lookup_word_generator('german-stems'), 
                                       morphers=[verb_morpher]))
vbox.delegate('nn', MorphologyVoicebox(lookup_word_generator('german-stems'), 
                                       morphers=[noun_morpher]))

tree = TreeNode.from_str("(S (@nn (STEM noun.76) (COUNT sng)) (@vb (STEM verb.22) (PERSON 3) (COUNT sng)))")
print(vbox.run(tree))
tree = TreeNode.from_str("(S (@nn (STEM noun.76) (COUNT plu)) (@vb (STEM verb.22) (PERSON 3) (COUNT plu)))")
print(vbox.run(tree))

(S vorrichschlag schmidmont)
(S vorrichschlagen schmidmonen)


Finally, we'll add a voicebox to generate German determiners, and wrap it all up as a *voicebox theme*. German determiners depend both on the *gender* of the noun (German nouns have three genders: masculine, feminine, and neuter) and on the *case* (nominative, accusative, dative, or genitive). We'll take care of gender later. For the time being, we'll assume that the nouns are masculine.

In [12]:
from testperanto.globals import EMPTY_STR
from testperanto.morphology import SuffixMorpher
from testperanto.voicebox import VoiceboxTheme, register_voicebox_theme
from testperanto.voicebox import ManagingVoicebox, MorphologyVoicebox
from testperanto.wordgenerators import lookup_word_generator
from testperanto.trees import TreeNode

class GermanTheme(VoiceboxTheme):

    def init_vbox(self):
        vbox = ManagingVoicebox()
        verb_morpher = SuffixMorpher(property_names=['PERSON', 'COUNT'],
                                     suffix_map={('1', 'sng'): 'e', ('1', 'plu'): 'en',
                                                 ('2', 'sng'): 'st', ('2', 'plu'): 't',
                                                 ('3', 'sng'): 't', ('3', 'plu'): 'en'})
        noun_morpher = SuffixMorpher(property_names=['COUNT'],
                                     suffix_map={('sng',): '', ('plu',): 'en'})
        dt_morpher = SuffixMorpher(property_names=('COUNT', 'CASE'),
                                   suffix_map={('sng', 'nom'): 'der',
                                               ('plu', 'nom'): 'die',
                                               ('sng', 'acc'): 'den',
                                               ('plu', 'acc'): 'die'})
        vbox.delegate('vb', MorphologyVoicebox(lookup_word_generator('german-stems'), 
                                               morphers=[verb_morpher]))
        vbox.delegate('nn', MorphologyVoicebox(lookup_word_generator('german-stems'), 
                                               morphers=[noun_morpher]))
        vbox.delegate('dt', MorphologyVoicebox(None, 
                                               morphers=[dt_morpher]))
        return vbox
    
vbox = GermanTheme().init_vbox()
tree = TreeNode.from_str( "(S"  
                        + "  (NP" 
                        + "     (DT (@dt (COUNT sng) (CASE nom)))" 
                        + "     (NN (@nn (STEM noun.1) (COUNT sng))))"
                        + "  (VB (@vb (STEM verb.1) (COUNT sng) (PERSON 3))))")
print(vbox.run(tree))
tree = TreeNode.from_str( "(S"  
                        + "  (NP" 
                        + "     (DT (@dt (COUNT plu) (CASE nom)))" 
                        + "     (NN (@nn (STEM noun.1) (COUNT plu))))"
                        + "  (VB (@vb (STEM verb.1) (COUNT plu) (PERSON 3))))")
print(vbox.run(tree))

(S (NP (DT der) (NN frischtelsung)) (VB derkomt))
(S (NP (DT die) (NN frischtelsungen)) (VB derkomen))


Now let's register this voicebox theme with testperanto, and use it to generate some simple noun phrases. We'll use y- and z-variables to enforce agreement between the determiner and the noun.

In [13]:
from testperanto.config import init_grammar_macro, generate_sentences
from testperanto.voicebox import register_voicebox_theme
register_voicebox_theme("deutsch", GermanTheme)

config = {
    "distributions": [
        {"name": "nn", "type": "pyor", "strength": 1, "discount": 0.4},
        {"name": "count", "type": "uniform", "domain": ["sng", "plu"]}
    ],
    "grammar": [
        {"rule": "START -> NP.$z1.nom.$z2", "zdists": ["nn", "count"]},
        {"rule": "NP.$y1.$y2.$y3 -> DT.$y2.$y3 NN.$y1.$y2.$y3"},
        {"rule": "DT.$y1.$y2 -> (@dt (CASE $y1) (COUNT $y2))"},
        {"rule": "NN.$y1.$y2.$y3 -> (@nn (STEM noun.$y1) (CASE $y2) (COUNT $y3))"}
    ]
}

grammar = init_grammar_macro(config)
for sent in generate_sentences(grammar, start_state='START', vbox_theme="deutsch", num_to_generate=10):
    print(sent)

100%|█████████████████████████████████████████████████| 10/10 [00:00<00:00, 2631.47it/s]

die manmantiken
der hubterstauder
die tungwohnhungen
der hubterstauder
die manmantiken
die tungwohnhungen
die wirmonschlaglungen
die manmantiken
der hubterstauder
die sammonteren





Note that the plurals are all associated with the plural determiner "die", and the singular nouns are all associate with the singular determiner "der". We can use the same principles to build simple sentences such that the main verb is conjugated appropriately for the subject, and that the determiners take the nominative form ("der", for singular nouns) when they determine the subject, and the accusative form ("den", for singular nouns) when they determine the object.

In [14]:
config = {
    "distributions": [
        {"name": "vb", "type": "pyor", "strength": 500, "discount": 0.4},
        {"name": "nn", "type": "pyor", "strength": 500, "discount": 0.4},
        {"name": "count", "type": "uniform", "domain": ["sng", "plu"]}
    ],
    "grammar": [
        {"rule": "START -> NP.$z1.nom.$z2 VP.$z1.$z2", "zdists": ["vb", "count"]},

        {"rule": "VP.$y1.$y2 -> VB.$y1.$y2 NP.$z1.acc.$z2", "zdists": ["nn", "count"]},
        {"rule": "NP.$y1.$y2.$y3 -> DT.$y2.$y3 NN.$z1.$y2.$y3", "zdists": ["nn"]},
        {"rule": "VB.$y1.$y2 -> (@vb (STEM verb.$y1) (COUNT $y2) (PERSON 3) (TENSE present))"},
        {"rule": "DT.$y1.$y2 -> (@dt (CASE $y1) (COUNT $y2))"},
        {"rule": "NN.$y1.$y2.$y3 -> (@nn (STEM noun.$y1) (CASE $y2) (COUNT $y3))"}
    ]
}
grammar = init_grammar_macro(config)
for sent in generate_sentences(grammar, start_state='START', vbox_theme="deutsch", num_to_generate=10):
    print(sent)

100%|█████████████████████████████████████████████████| 10/10 [00:00<00:00, 1152.47it/s]

die gunsichen tungschmiden die beikeitbeimilchen
der rungternach wirzerdungt den dichrichjahr
die fraunachen gnughafstauen den vorver
die sichkeiten tagsungen die keitgunkonen
die kolfrischen kenntertunen den gunschlag
die trummeinglieden tuntikteren die gehmilchwiren
die dersungrunlagen keitdungmenschen den menschdichnach
die kolgliedlungen hubnieden die tungsungen
der lagzumtrum lerlagt die dichzersungen
der terlichwir wohnschmidt den gliednach





Now let's add gender to our nouns. First we augment our voicebox theme to generate the correct determiners for each combination of case and gender:

In [15]:
class GermanTheme(VoiceboxTheme):
    """A voicebox theme that generates nursery rhyme-esque words."""

    def init_vbox(self):
        vbox = ManagingVoicebox()
        verb_morpher = SuffixMorpher(property_names=('COUNT',),
                                     suffix_map={('sng',): 'e', ('plu',): 'en'})
        noun_morpher = SuffixMorpher(property_names=('COUNT',),
                                     suffix_map={('sng',): '', ('plu',): 'en'})
        vbox.delegate('vb', MorphologyVoicebox(lookup_word_generator('german-stems'), [verb_morpher]))
        vbox.delegate('nn', MorphologyVoicebox(lookup_word_generator('german-stems'), [noun_morpher]))
        dt_morph = SuffixMorpher(property_names=('COUNT', 'CASE', 'GENDER'),
                                 suffix_map={('sng', 'nom', 'm'): 'der',
                                             ('plu', 'nom', 'm'): 'die',
                                             ('sng', 'acc', 'm'): 'den',
                                             ('plu', 'acc', 'm'): 'die',
                                             ('sng', 'nom', 'f'): 'die',
                                             ('plu', 'nom', 'f'): 'die',
                                             ('sng', 'acc', 'f'): 'die',
                                             ('plu', 'acc', 'f'): 'die',
                                             ('sng', 'nom', 'n'): 'das',
                                             ('plu', 'nom', 'n'): 'die',
                                             ('sng', 'acc', 'n'): 'das',
                                             ('plu', 'acc', 'n'): 'die'
                                             })
        vbox.delegate('dt', MorphologyVoicebox(None, [dt_morph]))
        return vbox

register_voicebox_theme("deutsch", GermanTheme)

Then we create distributions over the form `gender.$y1`, where `$y1` is the vocabulary index of a particular noun. To make sure that the same gender is always assigned to a particular noun, we use "sticky" distributions, which are distributions which, the first time they are sampled from, sample from a CategoricalDistribution over `{'m', 'f', 'n'}`. Every subsequent time they are sampled from, they return the first sampled value.

In [16]:
config = {
    "distributions": [
        {"name": "vb", "type": "pyor", "strength": 500, "discount": 0.4},
        {"name": "nn", "type": "pyor", "strength": 1, "discount": 0.4},
        {"name": "gender.$y1", "type": "sticky", "domain": ["m", "f", "n"], "weights": [0.3, 0.3, 0.4]},
        {"name": "count", "type": "uniform", "domain": ["sng", "plu"]}
    ],
    "grammar": [
        {"rule": "START -> NP.$z1.nom.$z2 VP.$z3.$z2", "zdists": ["nn", "count", "vb"]},
        {"rule": "VP.$y1.$y2 -> VB.$y1.$y2 NP.$z1.acc.$z2", "zdists": ["nn", "count"]},
        {"rule": "NP.$y1.$y2.$y3 -> DT.$y2.$y3.$z1 NN.$y1.$y2.$y3.$z1", "zdists": ["gender.$y1"]},
        {"rule": "VB.$y1.$y2 -> (@vb (STEM verb.$y1) (COUNT $y2) (PERSON 3) (TENSE present))"},
        {"rule": "DT.$y1.$y2.$y3 -> (@dt (CASE $y1) (COUNT $y2) (GENDER $y3))"},
        {"rule": "DT.$y1.$y2.$y3 -> (@dt (CASE $y1) (COUNT $y2) (GENDER $y3))"},
        {"rule": "NN.$y1.$y2.$y3.$y4 -> (@nn (STEM noun.$y1) (CASE $y2) (COUNT $y3) (GENDER $y4))"}
    ]
}


In [17]:
grammar = init_grammar_macro(config)
for sent in generate_sentences(grammar, start_state='START', vbox_theme="deutsch", num_to_generate=10):
    print(sent)

100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 968.06it/s]

die langhanen gunsamen die herrmonen
die dunggnughafmenschen manmanvoren das gliedkeitmon
die lagfin flachsterdiche die hanherrvorrungen
die tagtrumen termeinmenschen das telstergun
das glieddernach koltike das langhan
die tagtrumen lerkeitlagen das trumman
die tagtrumen lagkonen den dunggnughafmensch
der konmein jahrkome das trumman
das trumman runsunglagschlage die konmeinen
die langhanen sunggliedfrauen das langhan





Observe that each noun always gets assigned a consistent gender.