<a href="https://colab.research.google.com/github/TurkuNLP/Turku-neural-parser-pipeline/blob/modularize/turku_neural_parser_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turku Neural Parser Pipeline - Python module version on Google Colab

* This is a basic tutorial for running the parser pipeline under Google Colab
* Makes it possible for anyone to run the parser with GPU acceleration

* This notebook downloads and uses the `models_fi_tdt_v2.7` Finnish model, in case you want to run this with another model, change the model name while `Downloading and unpacking the model` and while `Running the parser`.


## Table of content

1. Install
2. Download and unpack the model
3. Running the parser
4. Process the output
5. Citations


# Install

* Install the pre-built wheel (takes its time)

`pip3 install http://dl.turkunlp.org/turku-parser-models/turku_neural_parser-0.3-py3-none-any.whl`

In [1]:
!wget -nc http://dl.turkunlp.org/turku-parser-models/turku_neural_parser-0.3-py3-none-any.whl
!pip3 install turku_neural_parser-0.3-py3-none-any.whl

--2020-12-14 10:20:00--  http://dl.turkunlp.org/turku-parser-models/turku_neural_parser-0.3-py3-none-any.whl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99697 (97K) [application/octet-stream]
Saving to: ‘turku_neural_parser-0.3-py3-none-any.whl’


2020-12-14 10:20:01 (293 KB/s) - ‘turku_neural_parser-0.3-py3-none-any.whl’ saved [99697/99697]

Processing ./turku_neural_parser-0.3-py3-none-any.whl
Collecting OpenNMT-py>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/9f/20/40f8b722aa0e35e259c144b6ec2d684f1aea7de869cf586c67cfd6fe1c55/OpenNMT_py-1.2.0-py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 9.1MB/s 
Collecting ufal.udpipe
[?25l  Downloading https://files.pythonhosted.org/packages/e5/72/2b8b9dc7c80017c790bb3308bbad34b57accfed2ac2f1f4ab252ff4e9cb2/ufal.udpipe-1.2.0.3.tar.gz (304kB)
[K     

## Prerequisites:

* The models here are tested with torch 1.7
* It might be that at some point this notebook will break
* If that happens, try to install torch 1.7


In [2]:
!nvcc --version
!python -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
Python 3.6.9


# Download and unpack the model

* Available models are listed here: http://dl.turkunlp.org/turku-parser-models/

* Download the model and unpack it

`wget http://dl.turkunlp.org/turku-parser-models/models_fi_tdt_v2.7.tar.gz ; tar zxvf models_fi_tdt_v2.7.tar.gz`

...and you are good to go!

In [3]:
!wget -nc http://dl.turkunlp.org/turku-parser-models/models_fi_tdt_v2.7.tar.gz
!tar zxvf models_fi_tdt_v2.7.tar.gz

--2020-12-14 10:22:59--  http://dl.turkunlp.org/turku-parser-models/models_fi_tdt_v2.7.tar.gz
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 590212039 (563M) [application/octet-stream]
Saving to: ‘models_fi_tdt_v2.7.tar.gz’


2020-12-14 10:23:34 (16.3 MB/s) - ‘models_fi_tdt_v2.7.tar.gz’ saved [590212039/590212039]

models_fi_tdt_v2.7/
models_fi_tdt_v2.7/pipelines.yaml
models_fi_tdt_v2.7/Tokenizer/
models_fi_tdt_v2.7/Tokenizer/tokenizer.udpipe
models_fi_tdt_v2.7/Lemmatizer/
models_fi_tdt_v2.7/Lemmatizer/big_lemma_cache.tsv
models_fi_tdt_v2.7/Lemmatizer/lemma_cache.tsv
models_fi_tdt_v2.7/Lemmatizer/lemmatizer.pt
models_fi_tdt_v2.7/Udify/
models_fi_tdt_v2.7/Udify/model.tar.gz


# Running the parser

* Every model can specify many processing pipelines
* These are in `modeldir/pipelines.yaml`
* `parse_plaintext`is the default
<br/><br/>
* `parse_plaintext` read plain text, tokenize, split into sentences, tag, parse, lemmatize
* `parse_sentlines` read text one sentence per line, tokenize, tag, parse, lemmatize
* `parse_wslines` read whitespace tokenized text one sentence per line, tag, parse, lemmatize
* `parse_conllu` read conllu, wipe existing values from all columns, tag, parse, lemmatize
* `tokenize` read plain text, tokenize, split into sentences
* `parse_noisytext` meant for noisy plaintext input (i.e. web crawled data), as parse_plaintext but truncates long sentences/tokens to avoid OOM issues


In [4]:
from tnparser.pipeline import read_pipelines, Pipeline

# print available pipelines for your model
available_pipelines=read_pipelines("models_fi_tdt_v2.7/pipelines.yaml") # insert your model name here (model-name/pipelines.yaml)
print(list(available_pipelines.keys()))


['parse_plaintext', 'parse_sentlines', 'parse_wslines', 'parse_conllu', 'tokenize', 'parse_noisytext']


In [5]:
# select the pipeline fitting your input data and load the model
# this one will take long on first run because of loading the model

p=Pipeline(available_pipelines["parse_plaintext"])
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että!") # insert your text here




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=424343.0, style=ProgressStyle(descripti…




Dataset reader: <class 'tnparser.udify.dataset_readers.universal_dependencies.UniversalDependenciesDatasetReader'>
0it [00:00, ?it/s]Your label namespace was 'upos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'xpos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'feats'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'lemmas'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your voca

In [6]:
print(parsed)

# newdoc
# newpar
# sent_id = 1
# text = Minulla on ruskea koira!
1	Minulla	minä	PRON	_	Case=Ade|Number=Sing|Person=1|PronType=Prs	0	root	_	_
2	on	olla	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	1	cop:own	_	_
3	ruskea	ruskea	ADJ	_	Case=Nom|Degree=Pos|Number=Sing	4	amod	_	_
4	koira	koira	NOUN	_	Case=Nom|Number=Sing	1	nsubj:cop	_	_
5	!	!	PUNCT	_	_	1	punct	_	_

# sent_id = 2
# text = Se haukkuu ja juoksee.
1	Se	se	PRON	_	Case=Nom|Number=Sing|PronType=Dem	2	nsubj	_	_
2	haukkuu	haukkua	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	_
3	ja	ja	CCONJ	_	_	4	cc	_	_
4	juoksee	juosta	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	conj	_	_
5	.	.	PUNCT	_	_	2	punct	_	_

# sent_id = 3
# text = Voi että!
1	Voi	voi	INTJ	_	_	2	discourse	_	_
2	että	että	INTJ	_	_	0	root	_	_
3	!	!	PUNCT	_	_	2	punct	_	_




# GPU mode

* The pipeline runs by default in CPU mode
* Needs to be told to run in GPU
* This is a bit tricky right now but not impossible
* Note: if you now switch the Runtime into GPU, you need to re-run the pip install


In [7]:
#I do realize this ain't good! :)
import types
extra_args=types.SimpleNamespace()
extra_args.__dict__["udify_mod.device"]="0" #simulates someone giving a --device 0 parameter to Udify
extra_args.__dict__["lemmatizer_mod.device"]="0" 

p=Pipeline(available_pipelines["parse_plaintext"],extra_args)
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että!")
print("Parsed has this many lines:",len(parsed.split("\n")))


Dataset reader: <class 'tnparser.udify.dataset_readers.universal_dependencies.UniversalDependenciesDatasetReader'>
0it [00:00, ?it/s]Your label namespace was 'upos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'xpos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'feats'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'lemmas'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your voca

Parsed has this many lines: 25


In [8]:
#Since we are on a GPU, we can try to push through quite a bit more of data
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että! "*200) #takes forever on CPU, finishes in few seconds on GPU
print("Parsed has this many lines:",len(parsed.split("\n")))

Parsed has this many lines: 4403


# Process the output

* The output of the pipeline run is a conll-u string
* You can parse it in any number of ways
* This is my preferred:

In [9]:
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10) #the 10 columns

def read_conll(inp,max_sent=0,drop_tokens=True,drop_nulls=True):
    """
    inp: list of lines or an open file
    max_sent: 0 for all, >0 to limit
    drop_tokens: ignore multiword token lines
    drop_nulls: ignore null nodes in enhanced dependencies

    Yields lines of the parse and comments
    """

    comments=[]
    sent=[]
    yielded=0
    for line in inp:
        line=line.rstrip("\n")
        if line.startswith("#"):
            comments.append(line)
        elif not line:
            if sent:
                yield sent,comments
                yielded+=1
                if max_sent>0 and yielded==max_sent:
                    break
                sent,comments=[],[]
        else:
            cols=line.split("\t")
            if drop_tokens and "-" in cols[ID]:
                continue
            if drop_nulls and "." in cols[ID]:
                continue
            sent.append(cols)
    else:
        if sent:
            yield sent,comments

for one_sent,comments in read_conll(parsed.split("\n"),5):
    words=(word_line[FORM] for word_line in one_sent)
    lemmas=(word_line[LEMMA] for word_line in one_sent)
    print(" ".join(words))
    print(" ".join(lemmas))
    print()

# and that's really all there is to it :)


Minulla on ruskea koira !
minä olla ruskea koira !

Se haukkuu ja juoksee .
se haukkua ja juosta .

Voi että !
voi että !

Minulla on ruskea koira !
minä olla ruskea koira !

Se haukkuu ja juoksee .
se haukkua ja juosta .



# Citations

Main reference currently under preparation, currently best papers describing the Turku-neural-parser are:

Turku-Neural-Parser-Pipeline (pre-bert version):
```
@inproceedings{udst:turkunlp,
author = {Jenna Kanerva and Filip Ginter and Niko Miekka and Akseli Leino and Tapio Salakoski},
title = {Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task},
booktitle = {Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium",
year={2018}
}
```

Lemmatizer:
```
@article{kanerva2020lemmatizer,
title={Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks},
author={Kanerva, Jenna and Ginter, Filip and Salakoski, Tapio},
year={2020},
journal={Natural Language Engineering},
publisher={Cambridge University Press},
DOI={10.1017/S1351324920000224},
pages={1--30},
url={http://dx.doi.org/10.1017/S1351324920000224}
}
```

Turku-Enhanced-Parser-Pipeline (bert version + enhanced dependencies):
```
@inproceedings{kanerva-etal-2020-turku,
    title = "{T}urku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the {IWPT} 2020 Shared Task",
    author = "Kanerva, Jenna  and
      Ginter, Filip  and
      Pyysalo, Sampo",
    booktitle = "Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.iwpt-1.17",
    doi = "10.18653/v1/2020.iwpt-1.17",
    pages = "162--173"
}
```

Consider also citing relevant software used in the pipeline:
* Udify: https://github.com/Hyperparticle/udify
* UDPipe v1: http://ufal.mff.cuni.cz/udpipe/1