In this notebook, we will use ASTrED in full-auto mode. That means that tokenisation, parsing, and word alignment all
 happen automatically. This is easy, but slow and likely less accurate than manual annotation. I would especially
 encourage you to use manual word alignments. But in this example, we show that you _can_ do it all automatically,
 which may be useful for large parallel corpus studies.

In [None]:
!pip install astred[stanza]
!pip install git+https://github.com/BramVanroy/awesome-align.git@astred_compat

In [None]:
from astred import AlignedSentences, Sentence

By default, the library assumes that the text that you provide is pretokenised and that words are separated by spaces.
 If that is not the case, we need to set `is_tokenised` to `False`.

In the cell below, you'll notice that we do not provide any word alignment information to the `AlignedSentences`
 constructor. What is actually happening, is that - if no word alignments are provided -, an automatic aligner is
 instantiated which aligns the given source and target tokens. To do that we rely on a fork of AwesomeAlign
 (see the README), which is a multilingual, neural aligner.

In [None]:
sent_en = Sentence.from_text("Yesterday, I ate some cookies.", "en", is_tokenized=False)
sent_nl = Sentence.from_text("Ik at gisteren wat koekjes.", "nl", is_tokenized=False)

aligned = AlignedSentences(sent_en, sent_nl)

As you can see below, these alignments are good - but not great. All alignments are correct, but the alignment between
"Yesterday" and "gisteren" is missing. The tokenizer did a perfect job, however!

In [None]:
print(aligned.src.text)
print(aligned.tgt.text)
print(aligned.giza_word_aligns)

We can also display the dependency trees to have a look at how well the automatic parser did. `stanza`
 (the parser) did a perfect job. The difference between `obl` for "Yesterday" and `advmod` for "gisteren" is open
 for discussion: `obl` is used for noun (phrases) and `advmod` for adverbs. Even on a theoretical level you can debate
 whether "yesterday" and "gisteren" are nouns or adverbs, but I'll leave that up to the theorists.

Note how the trees display both the text and the dependency relation? You can specify whichever attribute of a `Word`
 that you want to (e.g. `upos`, `id`, `head` and so on) to `attrs` and it will be included in the tree.

In [None]:
# This cell does not work on remote environments such as Colab
# Un-comment to try it on your local machine
# from nltk.tree import Tree as NltkTree
# from IPython.display import display
#
# display(NltkTree.fromstring(sent_en.tree.to_string(attrs=["text", "deprel"])))
# display(NltkTree.fromstring(sent_nl.tree.to_string(attrs=["text", "deprel"])))

## Separate aligner

As a default, the aligner will make use of the pretrained model `bert-base-multilingual-cased`. This will be downloaded
 automatically behind the scenes. However, you may choose to train/finetune your own model, or download
 [pre-existing ones](https://github.com/neulab/awesome-align#model-performance), and use that instead. If that is the
 case, you can instantiate an aligner from-scratch and pass that to the `AlignedSentences` constructor, too! This way,
 the provided aligner will be used instead of the default one that uses `bert-base-multilingual-cased`.

In [None]:
from astred import Aligner

# kwargs can contain some options specific to Awesome Aligner. Most important is probably the use of a GPU. By default
# a GPU will be used if it is available.
kwargs = {"no_cuda": True}
# Provide directory that contains the pytorch_model.bin and other files
# In this example, this code will not work of course because no custom model is given
aligner = Aligner(r"C:\path\to\your\model\dir", **kwargs)

sent_en = Sentence.from_text("Yesterday, I ate some cookies.", "en", is_tokenized=False)
sent_nl = Sentence.from_text("Ik at gisteren wat koekjes.", "nl", is_tokenized=False)

aligned = AlignedSentences(sent_en, sent_nl, aligner=aligner)

If no `aligner` is provided, a class variable `AlignedSentences._aligner` will contain a default aligner
 that is used by all `AlignedSentences` instances. If you do not wish to use this default aligner, you can use the
 method above.