# Working with languages other than English

After reading this section, you should:

 - know how to download and use language models in Stanza, a Python library for processing many languages
 - how to interface Stanza with the spaCy natural language processing library

## Introduction

Part II introduced basic natural language processing tasks using examples written in the English language.

As a global *lingua franca*, English is a highly-resourced language in terms of natural language processing. This means that compared to other languages, English has far more resources in terms of both volume and variety. In other words, it is easier to find large datasets for English that can be used to train models to perform diverse tasks. 

The imbalance in resources and research effort has led to a situation where the advances in processing the English language are occasionally claimed to hold for natural language in general. 

However, as Bender ([2019](https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/)) has pointed out, *English is not a synonym for natural language*: even if one demonstrates that computers can achieve or surpass human-level performance in some natural language processing task for the English language, this does not mean that one has solved this task or problem for *natural language as a whole*.

To measure progress in the field of natural language processing and to ensure that as many languages as possible can benefit from language technology, it is highly desirable to conduct research on processing typologically different languages across the world. 

## Stanza – a Python library for processing many languages

[Stanza](https://stanfordnlp.github.io/stanza/) is a Python library for natural language processing that provides pre-trained language models for [many languages](https://stanfordnlp.github.io/stanza/available_models.html) (Qi et al. [2020](https://www.aclweb.org/anthology/2020.acl-demos.14/)).

Stanza language models are trained on corpora annotated using the [Universal Dependencies](https://universaldependencies.org/) formalism, which means that the models can perform tasks such as tokenization, part-of-speech tagging, morphological tagging and dependency parsing. These are essentially the same tasks that we explored using the spaCy library in [Part II](../part_ii/03_basic_nlp.ipynb).

Let's start by importing the Stanza library.

In [None]:
# Import the Stanza library
import stanza

To process a given language, we must first download a Stanza language model using the `download()` function.

The `download()` function requires a single argument, `lang`, which defines the language model to be downloaded.

To download a language model for a language, retrieve the two-letter language code (e.g. *wo*) for the language from [the list of available language models](https://stanfordnlp.github.io/stanza/available_models.html) and pass the language code as a string object to the `lang` argument.

For example, use the following code to download a model for Wolof, a language spoken in West Africa that belongs to the family of Niger-Congo languages. The model has been trained using the Wolof treebank (Dione [2019](https://www.aclweb.org/anthology/W19-8003/)).

```python
stanza.download(lang='wo')
```

For some languages, Stanza provides several models that have been trained on different datasets. By default, Stanza automatically downloads the model trained on the largest available dataset for a given language.

To select a model trained on a specific dataset, pass the name of the package as a string object to the `package` argument.

To exemplify, the following command downloads a model for Finnish trained on the [*FinnTreeBank*](https://universaldependencies.org/treebanks/fi_ftb/index.html) dataset instead of the default model, which is trained on the [*Turku Dependency Treebank*](https://universaldependencies.org/treebanks/fi_tdt/index.html) dataset.

```python
stanza.download(lang='fi', package='ftb')
```

The package names are provided in [the list of language models](https://stanfordnlp.github.io/stanza/available_models.html) available for Stanza.

To install the language model into the permanent storage on [CSC Notebooks](https://notebooks.csc.fi/), we must also pass the optional `model_dir` argument that points towards a directory in the permanent storage. Without using the permanent storage, the models are deleted when the server is shut down.

Run the following cell to download the Stanza language model for Wolof into the directory `../stanza_models`.

Note that `..` moves up one step in the directory structure relative to this notebook, which places the models into the directory `notebooks`.

In [None]:
stanza.download(lang="wo", model_dir="../stanza_models")

To load a Stanza language model, we must first create a *Pipeline* object using the `Pipeline()` function available in the `stanza` module.

To exemplify the procedure, let's initialise a pipeline with a language model for Wolof.

To load a language model for Wolof into the pipeline, provide the string `wo` to the `lang` argument.

Because we did **not** download the language model into the default directory, we must also provide a string containing the path to the directory with Stanza language models to the `dir` argument.

In [None]:
# Use the Pipeline() method to initialise a Stanza pipeline with a language model for Wolof, which
# is assigned to the variable 'nlp'.
nlp = stanza.Pipeline(lang='wo', dir="../stanza_models")

# Call the variable to examine the output
nlp

This returns a Stanza *Pipeline* object, which contains the language model and a number of *processors* that perform various natural language processing tasks.

Now that we have initialised a pipeline with a language model, we can feed some text in Wolof to the model under `nlp` as a string object and store the result under the variable `doc`.

In [None]:
# Feed text to the model under 'nlp'; store result under doc
doc = nlp("Réew maa ngi lebe turam wi ci dex gi ko peek ci penku ak bëj-gànnaar, te ab balluwaayam bawoo ca Fuuta Jallon ca Ginne, di Dexug Senegaal. Ab kilimaam bu gëwéel la te di bu fendi te yor ñaari jamono: jamonoy nawet (jamonoy taw) ak ju noor (jamonoy fendi).")

# Check the type of the output
type(doc)

This returns a Stanza [*Document*](https://stanfordnlp.github.io/stanza/data_objects.html#document) object, which contains the linguistic annotations created using the model.

To begin with, the attribute `sentences` of a *Document* object contains a list of lists, in which each list stands for a sentence.

Let's use the brackets to access the first item `[0]` in the list.

In [None]:
# Get the first item in the list of sentences
doc.sentences[0]

Although the output contains both brackets `[]` and curly braces `{}`, which Python uses for lists and dictionaries, respectively, the output is not a list with nested dictionaries, but a Stanza [*Sentence*](https://stanfordnlp.github.io/stanza/data_objects.html#sentence) object.

In [None]:
# Check the type of the first item in the Document object
type(doc.sentences[0])

The *Sentence* object contains various attributes and methods for accessing the linguistic annotations.

If we wish to interact with the annotations using data structures native to Python, we can use the `to_dict()` method to cast the annotations into a list of dictionaries, where each dictionary stands for a single Stanza [*Token*](https://stanfordnlp.github.io/stanza/data_objects.html#token) object.

The *key* and *value* pairs in these dictionaries contain the linguistic annotations for each *Token*.

In [None]:
# Cast the first Sentence object into a dictionary; store under variable 'doc_dict'
doc_dict = doc.sentences[0].to_dict()

# Get the dictionary for the first Token
doc_dict[0]

We can retrieve specific annotations from the dictionary by placing their keys as string objects within brackets.

In [None]:
# Get the value under key 'lemma' for the first item [0] in the dictionary 'doc_dict'
doc_dict[0]['lemma']

This provides us with the lemma of the word "réew", which stands for "country".

## Interfacing Stanza with spaCy

A Python library named [*spacy-stanza*](https://spacy.io/universe/project/spacy-stanza) allows using Stanza language models in spaCy, which considerably increases the number of pre-trained models available for languages other than English.

To start using Stanza language models in spaCy, let's start by importing both libraries and the `StanzaLanguage` class from the *spacy-stanza* library.

In [None]:
import spacy_stanza