# Working with languages beyond English

## Introduction

Part II introduced basic natural language processing tasks using examples in English.

As the current *lingua franca*, English is a highly-resourced language in terms of natural language processing. This means that compared to other languages, English has far more resources (e.g. training and evaluation data) and models trained to perform a wider range of tasks.

The imbalance in resources and research effort has led to a situation where the advances in processing the English language are occasionally claimed to hold for natural language in general. 

However, as Bender ([2019](https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/)) has pointed out, *English is not a synonym for natural language*: even if one demonstrates that computers can achieve or surpass human-level performance in some natural language processing task for the English language, this does not mean that one has solved this task or problem for *natural language as a whole*.

To measure progress in natural language processing and to ensure that as many languages as possible can benefit from advances in the field, it is highly desirable to conduct research on natural language processing for typologically different languages across the world. 

## Extending spaCy through Stanza

[Stanza](https://stanfordnlp.github.io/stanza/) is a Python library for natural language processing that currently provides pre-trained language models for [66 languages](https://stanfordnlp.github.io/stanza/available_models.html) (Qi et al. [2020](https://www.aclweb.org/anthology/2020.acl-demos.14/)).

Stanza builds on the [Universal Dependencies](https://universaldependencies.org/) formalism, which means that Stanza models can perform tasks such as tokenization, part-of-speech tagging, morphological tagging and dependency parsing, which were introduced in [Part II](../part_ii/03_basic_nlp.ipynb) using the spaCy library.

The spaCy library provides an extension named [*spacy-stanza*](https://spacy.io/universe/project/spacy-stanza) that allows using Stanza language models in spaCy, which considerably increases the number of pre-trained models available for languages other than English.

To start using Stanza language models in spaCy, let's start by importing both NLP libraries and the `StanzaLanguage` class from the *spacy-stanza* library.  

In [None]:
# Import libraries
import spacy
import stanza

# Import StanzaLanguage class from the spacy_stanza library
from spacy_stanza import StanzaLanguage

First, we must download the Stanza language model we wish to use using the `download()` method.

The `download()` method requires a single argument, `lang`, which defines the language model to be downloaded.

To download a language model for a given language, retrieve the two-letter language code (e.g. *en*) for the language from [the list of available language models](https://stanfordnlp.github.io/stanza/available_models.html) and pass the language code as a string object to the `lang` argument.

To place the downloaded language model into the permanent storage on [CSC Notebooks](https://notebooks.csc.fi/), we must also pass the optional `dir` argument that points towards a directory in the permanent storage.

Run the following cell to download the Stanza language model for English.

In [None]:
stanza.download(lang="en", dir="~/stanza_models")

Stanza automatically downloads the model trained on the largest available dataset for a given language.

If you wish to use an alternative model, you must pass the optional argument `package` to the `download()` method and provide the package name as a string object, e.g. `stanza.download(lang="fi", package="ftb")`. The package names are provided in [the list of language models](https://stanfordnlp.github.io/stanza/available_models.html) available for Stanza.

We first initialise a Stanza language model using the `Pipeline()` method.

For the purpose of exemplifying the procedure, let's import a model for the English language.

Models for specific languages are loaded by providing the language code as a string to the `lang` argument. 

If Stanza models have been downloaded into some other directory than the default one, a path to this directory must be provided as a string to the `dir` argument.

In [None]:
# Use the Pipeline() method to initialise a Stanza
# pipeline with a language model for English, which
# is assigned to the variable 'stanza_model'.
stanza_model = stanza.Pipeline(lang="en", 
                               dir="~/stanza_models")

# Call the variable to examine the output
stanza_model

This returns a Stanza *Pipeline* object, which contains the language model and a number of components performing various natural language processing tasks.

In [None]:
nlp = StanzaLanguage(stanza_model)