testperanto tutorial 2: in which we get lazy
--------------------------------------------------

One issue with the grammars from Tutorial 1 is that we had to explicitly specify generic words like ```noun.34``` and ```verb.281```. It would be simpler to just have the computer generate these generic words from a template. 

We do so by replacing rules with a more general form called a **rule macro**. 

In [7]:
from testperanto.config import init_grammar_macro, generate_sentences

config = {"grammar": [
            {"rule": "START -> NN"},
            {"rule": "NN -> (@nn (STEM noun.$z1) (COUNT sng))", "zdists": ["nn"]}
          ],
          "distributions": [
            {"name": "nn", "type": "uniform", "domain": [1, 2, 3]}
          ]}
grammar = init_grammar_macro(config)

The rule macro:

     NN -> (@nn (STEM noun.$z1) (COUNT sng)) 
     
introduces the variable `$z1` and implicitly represents all of the rules we could obtain by substituting in for `$z1`. But this raises two questions:

1. What can we substitute in?
2. What are the weights of the expanded rules?

To answer these questions, we associate `$z1` with a distribution called `nn`. In this case, we have specified that the `nn` distribution is a uniform distribution over the integers from 1 to 3, by providing a key called `zdists` associated with a list of distributions. Each distribution in this list corresponds to a z-variable, starting with `$z1`, `$z2`, etc. Thus the macro `$qnc -> (NC nc.$z1)` abbreviates the 3 rules:

    NN -> (@nn (STEM noun.1) (COUNT sng)) 
    NN -> (@nn (STEM noun.2) (COUNT sng)) 
    NN -> (@nn (STEM noun.3) (COUNT sng)) 

The weight of each rule is the baseweight of the macro (recall that this defaults to `1.0`) multiplied by the probability of its instantiation of `$z1` according to distribution `nn`. Thus, each rule has the weight `1/3`. Therefore, if we generate sentences from this grammar, we should obtain three different words in roughly equal proportion.

In [8]:
for sent in generate_sentences(grammar, start_state='START', num_to_generate=10):
    print(sent)

100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 3374.88it/s]

mofruglolal
dulahot
mofruglolal
dulahot
mofruglolal
mofruglolal
mofruglolal
leefrijod
dulahot
dulahot





Rule macros can contain two different types of variables:
- **y-variables**: which can appear on both the left and right sides of a macro, and are **never** associated with a distribution
- **z-variables**: which can only appear on the right side of a macro, and **must** be associated with a distribution

The role of y-variables can be seen in the following example:

In [9]:
from testperanto.config import init_grammar_macro, generate_sentences

config = {"grammar": [
            {"rule": "START -> ADJ.$z1 NN.$z1", "zdists": ["unif"]},
            {"rule": "ADJ.$y1 -> (@adj (STEM adj.$y1))"},
            {"rule": "NN.$y1 -> (@nn (STEM noun.$y1) (COUNT plu))"}
          ],
          "distributions": [
            {"name": "unif", "type": "uniform", "domain": [1, 2]}
          ]}
grammar = init_grammar_macro(config)
for sent in generate_sentences(grammar, start_state='START', num_to_generate=10):
    print(sent)

100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 2334.97it/s]

chifraludish foglurs
chifraludish foglurs
chifraludish foglurs
chofilugish flacochamafs
chifraludish foglurs
chofilugish flacochamafs
chofilugish flacochamafs
chifraludish foglurs
chofilugish flacochamafs
chofilugish flacochamafs





Observe that the same adjective always appears with the same noun, because the y-variables ensure the correspondence. The first macro 

    START -> ADJ.$z1 NN.$z1
    
encodes the two rules:

    START -> ADJ.1 NN.1
    START -> ADJ.2 NN.2

And then the y-variables in the second and third macros propagate this information. For instance, the second macro

    ADJ.$y1 -> (@adj (STEM adj.$y1))
    
encodes the rules:

    ADJ.1 -> (@adj (STEM adj.1))
    ADJ.2 -> (@adj (STEM adj.2))
    
whereas the third macro

    NN.$y1 -> (@nn (STEM noun.$y1) (COUNT plu))

encodes the rules:

    NN.1 -> (@nn (STEM noun.1) (COUNT plu))
    NN.2 -> (@nn (STEM noun.2) (COUNT plu))
    
**Important:** there is no need to specify a domain or distribution for y-variables. They will automatically match any nonterminal of that form that is constructed during a CFG derivation.