<a href="https://colab.research.google.com/github/TurkuNLP/Turku-neural-parser-pipeline/blob/master/docs/tnpp_diaparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turku Neural Parser Pipeline

* A mini-tutorial of the latest version of the parser pipeline
* Make sure to run it with GPU enabled (Runtime -> Change runtime type -> GPU)


# Modules

## Segmentation

* Tokenization and sentence segmentation happens jointly, and is implemented using the UDPipe library
* Machine-learned sequence classification model

## PoS and morphological tagging

* A BERT-based classification model
* Joint prediction of PoS and Tags
* Implemented in Pytorch Lightning

## Dependency parsing

* Parsing is done using the [diaparser](https://github.com/Unipisa/diaparser) parser
* A BERT-based model, implemented in Torch

## Lemmatization

* Lemmatization is a sequence-to-sequence model
* Wordform + Tags -> Lemma
* Fully machine-learned
* Implemented using OpenNMT (a machine translation library)

## GPU

* Current accuracy far beyond previous versions of this pipeline
* Cost: computationally intense deep neural network models
* Small tests and examples can run on CPU, but any non-trivial amount of text needs a GPU accelerator

# INSTALL

* git clone the code
* cd to the directory
* and install all requirements
* this does take its time, the parser leans on quite large libraries

In [1]:
!git clone https://github.com/TurkuNLP/Turku-neural-parser-pipeline.git
%cd Turku-neural-parser-pipeline

Cloning into 'Turku-neural-parser-pipeline'...
remote: Enumerating objects: 1277, done.[K
remote: Counting objects: 100% (318/318), done.[K
remote: Compressing objects: 100% (138/138), done.[K
remote: Total 1277 (delta 188), reused 304 (delta 179), pack-reused 959[K
Receiving objects: 100% (1277/1277), 367.26 KiB | 4.32 MiB/s, done.
Resolving deltas: 100% (746/746), done.
/content/Turku-neural-parser-pipeline


# Google Colab -specific installation

* Let us install only what we need for Google Colab
* Import pytorch_lighting to avoid a problem later
* Normally, you would install using `requirements.txt`

In [2]:
!python3 -m pip install ufal.udpipe configargparse transformers "OpenNMT-py>=1.2.0" "git+https://github.com/TurkuNLP/diaparser.git@master" "pytorch_lightning<1.5.0" "torchmetrics<=0.7.3"

Collecting git+https://github.com/TurkuNLP/diaparser.git@master
  Cloning https://github.com/TurkuNLP/diaparser.git (to revision master) to /tmp/pip-req-build-gkx7df6o
  Running command git clone -q https://github.com/TurkuNLP/diaparser.git /tmp/pip-req-build-gkx7df6o
Collecting ufal.udpipe
  Downloading ufal.udpipe-1.2.0.3.tar.gz (304 kB)
[K     |████████████████████████████████| 304 kB 5.0 MB/s 
[?25hCollecting configargparse
  Downloading ConfigArgParse-1.5.3-py3-none-any.whl (20 kB)
Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 9.1 MB/s 
[?25hCollecting OpenNMT-py>=1.2.0
  Downloading OpenNMT_py-2.2.0-py3-none-any.whl (216 kB)
[K     |████████████████████████████████| 216 kB 44.7 MB/s 
[?25hCollecting pytorch_lightning<1.5.0
  Downloading pytorch_lightning-1.4.9-py3-none-any.whl (925 kB)
[K     |████████████████████████████████| 925 kB 42.2 MB/s 
[?25hCollecting torchmetrics<=0.7.3
  Down

In [3]:
!pip3 install "torchmetrics<=0.7.3"



In [4]:
import pytorch_lightning


# FETCH MODEL

* At present, only the Finnish (fi_tdt_dia) and English (en_ewt_dia) models are available for the most recent diaparser-based version of the pipeline
* Models documented here: http://turkunlp.org/Turku-neural-parser-pipeline/models.html
* ...the remainder of UD languages is in the works...

In [5]:
!python3 fetch_models.py fi_tdt_dia

Downloading from fi_tdt_dia and unpacking


* Note: this might take a while, the model is quite large (>1GB)
* The above command created the directory `models_fi_tdt_dia` with the model
* The file `models_fi_tdt_dia/pipelines.yaml` defines all the possible pipelines for the parser in this model
* The `parse_plaintext` is the correct choice in most situations

# PARSE IN PYTHON

* You need to load and start the pipeline of choice
* Like so:

In [6]:
from tnparser.pipeline import read_pipelines, Pipeline

# What pipelines do we have for the Finnish model?
available_pipelines=read_pipelines("models_fi_tdt_dia/pipelines.yaml")               # {pipeline_name -> its steps}
# This is a dictionary, its keys are the pipelines
print(list(available_pipelines.keys()))
# Instantiate one of the pipelines
p=Pipeline(available_pipelines["parse_plaintext"])    

['parse_plaintext', 'tag_plaintext', 'parse_sentlines', 'parse_wslines', 'parse_conllu', 'tokenize', 'parse_noisytext']


  "The `@auto_move_data` decorator is deprecated in v1.3 and will be removed in v1.5."
INFO:root:Loading model from /content/Turku-neural-parser-pipeline/models_fi_tdt_dia/Tagger/best.ckpt
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
Lemmatizer device: gpu / 0


In [7]:
txt_in="Minulla on söpö koira. Se haukkuu, syö makkaraa, jahtaa oravia ja tsillailee kanssani!"
parsed=p.parse(txt_in)
print(parsed)

# newdoc
# newpar
# sent_id = 1
# text = Minulla on söpö koira.
1	Minulla	minä	PRON	_	Case=Ade|Number=Sing|Person=1|PronType=Prs	0	root	_	_
2	on	olla	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	1	cop:own	_	_
3	söpö	söpö	ADJ	_	Case=Nom|Degree=Pos|Number=Sing	4	amod	_	_
4	koira	koira	NOUN	_	Case=Nom|Number=Sing	1	nsubj:cop	_	SpaceAfter=No
5	.	.	PUNCT	_	_	1	punct	_	_

# sent_id = 2
# text = Se haukkuu, syö makkaraa, jahtaa oravia ja tsillailee kanssani!
1	Se	se	PRON	_	Case=Nom|Number=Sing|PronType=Dem	2	nsubj	_	_
2	haukkuu	haukkua	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	SpaceAfter=No
3	,	,	PUNCT	_	_	4	punct	_	_
4	syö	syödä	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	conj	_	_
5	makkaraa	makkara	NOUN	_	Case=Par|Number=Sing	4	obj	_	SpaceAfter=No
6	,	,	PUNCT	_	_	7	punct	_	_
7	jahtaa	jahtaa	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	conj	_	_
8	oravia	orava	NOUN	_	Case=Pa

# Parsing more data

* You might have many files with data you need to parse
* If you have massive documents, it makes sense to split them into manageable pieces
* Here is a basic example of how to achieve that
* You can download an example zip file I prepared from here: [http://dl.turkunlp.org/.ginter/news_test_data.zip](http://dl.turkunlp.org/.ginter/news_test_data.zip)
* Or simply upload your own


In [8]:
#Remember this notebook uses Turku-neural-parser-pipeline as its working directory
!wget http://dl.turkunlp.org/.ginter/news_test_data.zip
!unzip news_test_data.zip #will unzip some 60 files into ./test_data

--2022-05-10 11:55:52--  http://dl.turkunlp.org/.ginter/news_test_data.zip
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 136098 (133K) [application/zip]
Saving to: ‘news_test_data.zip’


2022-05-10 11:55:53 (308 KB/s) - ‘news_test_data.zip’ saved [136098/136098]

Archive:  news_test_data.zip
   creating: test_data/
  inflating: test_data/yle_news_0061.txt  
  inflating: test_data/yle_news_0053.txt  
  inflating: test_data/yle_news_0052.txt  
  inflating: test_data/yle_news_0050.txt  
  inflating: test_data/yle_news_0017.txt  
  inflating: test_data/yle_news_0044.txt  
  inflating: test_data/yle_news_0001.txt  
  inflating: test_data/yle_news_0005.txt  
  inflating: test_data/yle_news_0009.txt  
  inflating: test_data/yle_news_0051.txt  
  inflating: test_data/yle_news_0029.txt  
  inflating: test_data/yle_news_0046.txt  
  inflating: test

* Now we have 67 text files in `test_data` and we would like to parse them

In [9]:
import glob #allows listing files
import tqdm #progress bar

all_files=glob.glob("test_data/*.txt") #list all files we need

for file_name in tqdm.tqdm(all_files):
    txt=open(file_name).read() #read the file
    parsed=p.parse(txt) #parse it
    with open(file_name.replace(".txt",".conllu"),"wt") as f_out: #open output file
        f_out.write(parsed) #and write out the result

100%|██████████| 67/67 [01:18<00:00,  1.17s/it]


* there are now parsed conllu files under `test_data` 

In [10]:
# Basic stats of the parsed files
!echo "Sentences:" ; cat test_data/*.conllu | grep -Pc '^1\t'
!echo "Tokens:" ; cat test_data/*.conllu | grep -Pc '^[0-9]+\t'

Sentences:
2689
Tokens:
35681


* Now we yet need to pack and download the data

In [11]:
!zip parsed.zip test_data/*.conllu

  adding: test_data/yle_news_0000.conllu (deflated 81%)
  adding: test_data/yle_news_0001.conllu (deflated 75%)
  adding: test_data/yle_news_0002.conllu (deflated 82%)
  adding: test_data/yle_news_0003.conllu (deflated 73%)
  adding: test_data/yle_news_0004.conllu (deflated 81%)
  adding: test_data/yle_news_0005.conllu (deflated 80%)
  adding: test_data/yle_news_0006.conllu (deflated 80%)
  adding: test_data/yle_news_0007.conllu (deflated 79%)
  adding: test_data/yle_news_0008.conllu (deflated 81%)
  adding: test_data/yle_news_0009.conllu (deflated 79%)
  adding: test_data/yle_news_0010.conllu (deflated 78%)
  adding: test_data/yle_news_0011.conllu (deflated 81%)
  adding: test_data/yle_news_0012.conllu (deflated 81%)
  adding: test_data/yle_news_0013.conllu (deflated 80%)
  adding: test_data/yle_news_0014.conllu (deflated 81%)
  adding: test_data/yle_news_0015.conllu (deflated 80%)
  adding: test_data/yle_news_0016.conllu (deflated 78%)
  adding: test_data/yle_news_0017.conllu (deflat

...and download the `parsed.zip` file and you're good to go

# Models

* Universal Dependencies models
* A handful of specialized models (e.g. biomedical English)
* Training new models not particularly difficult, documentation for the diaparser-based pipeline training in the works

# Failure modes

* Generally this is a pretty stable parser, it was used to parse some hundreds of millions of sentences successfully
* Most failures stem from the bleeding-edge libraries we are forced to use; these keep changing rapidly
* Backward-incompatible, breaking changes are very common
* Google Colab environment regularly upgraded to newest versions of many common libraries, and this might break some dependencies

In case of failure:

* Runtime -> Factory reset runtime, try again
* Check that you are on a GPU runtime, large files might still take long to parse -> split your data into more manageable pieces
* Ping Filip Ginter or Jenna Kanerva with as good a description of the problem as possible
