[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLARIN-PL/NlpRest2-Tutorials/blob/master/part1.ipynb)

# Part 1 — Introduction to the CLARIN-PL web services

## 1. Basic characteristic

* REST model,
* GET/POST communication,
* synchronous (for short texts and fast tasks) and asynchronous (time-consuming processing),
* LPMN — a notion for defining the processing pipeline (http://nlp.pwr.wroc.pl/redmine/projects/nlprest2/wiki/Tools)

## 2. The simplest use case

Process a short sentence using synchronous GET request.

In [1]:
import json
import requests

clarinpl_url = "http://ws.clarin-pl.eu/nlprest2/base"
user_mail = "demo2019@nlpday.pl"

In [2]:
url = clarinpl_url + "/process"
lpmn = "wcrft2"
text = "Na płocie siedzi kot."

payload = {'text': text, 'lpmn': lpmn, 'user': user_mail}
headers = {'content-type': 'application/json'}

In [3]:
r = requests.post(url, data=json.dumps(payload), headers=headers)
ccl = r.content.decode('utf-8')
print(ccl)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>Na</orth>
    <lex disamb="1"><base>na</base><ctag>prep:acc</ctag></lex>
   </tok>
   <tok>
    <orth>płocie</orth>
    <lex disamb="1"><base>płot</base><ctag>subst:sg:loc:m3</ctag></lex>
   </tok>
   <tok>
    <orth>siedzi</orth>
    <lex disamb="1"><base>siedzieć</base><ctag>fin:sg:ter:imperf</ctag></lex>
   </tok>
   <tok>
    <orth>kot</orth>
    <lex disamb="1"><base>kot</base><ctag>subst:sg:nom:m1</ctag></lex>
   </tok>
   <ns/>
   <tok>
    <orth>.</orth>
    <lex disamb="1"><base>.</base><ctag>interp</ctag></lex>
   </tok>
  </sentence>
 </chunk>
</chunkList>



### Print a list of token text forms

In [4]:
import xml.etree.ElementTree as ET

def ccl_orths(ccl):
    tree = ET.fromstring(ccl)
    return [orth.text for orth in tree.iter('orth')]

orths = ccl_orths(ccl)

print(orths)

['Na', 'płocie', 'siedzi', 'kot', '.']


### Print a list of token bases

In [5]:
def ccl_bases(ccl):
    tree = ET.fromstring(ccl)
    return [tok.find('./lex/base').text for tok in tree.iter('tok')]

bases = ccl_bases(ccl)
    
print(bases)

['na', 'płot', 'siedzieć', 'kot', '.']


### Print a list of token part of speech tags

http://nkjp.pl/poliqarp/help/ense2.html

In [6]:
def ccl_poses(ccl):
    tree = ET.fromstring(ccl)
    return [tok.find('./lex/ctag').text.split(":")[0] for tok in tree.iter('tok')]

poses = ccl_poses(ccl)

print(poses)

['prep', 'subst', 'fin', 'subst', 'interp']


### Tag and recognize named entities (boundaries)

In [7]:
url = clarinpl_url + "/process"
#lpmn = 'wcrft2'
lpmn = "wcrft2|liner2"
text = "Tony Halik przyszedł na świat w Toruniu"

payload = {'text': text, 'lpmn': lpmn, 'user': user_mail}
headers = {'content-type': 'application/json'}

In [8]:
r = requests.post(url, data=json.dumps(payload), headers=headers)
print(r.content.decode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk type="p" id="ch1">
  <sentence id="s1">
   <tok>
    <orth>Tony</orth>
    <lex disamb="1"><base>ton</base><ctag>subst:pl:nom:m3</ctag></lex>
    <ann chan="nam">1</ann>
   </tok>
   <tok>
    <orth>Halik</orth>
    <lex disamb="1"><base>Halik</base><ctag>ign</ctag></lex>
    <ann chan="nam">1</ann>
   </tok>
   <tok>
    <orth>przyszedł</orth>
    <lex disamb="1"><base>przyjść</base><ctag>praet:sg:m1:perf</ctag></lex>
    <ann chan="nam">0</ann>
   </tok>
   <tok>
    <orth>na</orth>
    <lex disamb="1"><base>na</base><ctag>prep:acc</ctag></lex>
    <ann chan="nam">0</ann>
   </tok>
   <tok>
    <orth>świat</orth>
    <lex disamb="1"><base>świat</base><ctag>subst:sg:nom:m3</ctag></lex>
    <ann chan="nam">0</ann>
   </tok>
   <tok>
    <orth>w</orth>
    <lex disamb="1"><base>w</base><ctag>prep:acc:nwok</ctag></lex>
    <ann chan="nam">0</ann>
   </tok>
   <tok>
    <orth>Toruniu</orth>
  

### Tag and recognize named entities (coarse-grained categories)

In [9]:
url = clarinpl_url + "/process"
#lpmn = 'wcrft2|liner2'
lpmn = 'wcrft2|liner2({"model":"top9"})'
text = "Tony Halik przyszedł na świat w Toruniu"

payload = {'text': text, 'lpmn': lpmn, 'user': user_mail}
headers = {'content-type': 'application/json'}

In [10]:
r = requests.post(url, data=json.dumps(payload), headers=headers)
print(r.content.decode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk type="p" id="ch1">
  <sentence id="s1">
   <tok>
    <orth>Tony</orth>
    <lex disamb="1"><base>ton</base><ctag>subst:pl:nom:m3</ctag></lex>
    <ann chan="nam_liv">1</ann>
    <ann chan="nam_loc">0</ann>
   </tok>
   <tok>
    <orth>Halik</orth>
    <lex disamb="1"><base>Halik</base><ctag>ign</ctag></lex>
    <ann chan="nam_liv">1</ann>
    <ann chan="nam_loc">0</ann>
   </tok>
   <tok>
    <orth>przyszedł</orth>
    <lex disamb="1"><base>przyjść</base><ctag>praet:sg:m1:perf</ctag></lex>
    <ann chan="nam_liv">0</ann>
    <ann chan="nam_loc">0</ann>
   </tok>
   <tok>
    <orth>na</orth>
    <lex disamb="1"><base>na</base><ctag>prep:acc</ctag></lex>
    <ann chan="nam_liv">0</ann>
    <ann chan="nam_loc">0</ann>
   </tok>
   <tok>
    <orth>świat</orth>
    <lex disamb="1"><base>świat</base><ctag>subst:sg:nom:m3</ctag></lex>
    <ann chan="nam_liv">0</ann>
    <ann chan="nam_loc">0</ann>

## Batch processing

CLARIN-PL WS can process a set of files uploaded as a zip package.

### Get a zip package with documents to process

In [11]:
import urllib.request

url_zip = "https://www.dropbox.com/s/54gmpdd6x3rx4gq/brexit_pl.zip?dl=1"

doc = urllib.request.urlopen(url_zip).read()
    
print("Size of the package: %d" % len(doc))

Size of the package: 800523


### Upload the package to CLARIN-PL WS

In [12]:
url = clarinpl_url + "/upload/"

headers = {'content-type': 'binary/octet-stream'}

file_handler = requests.post(url, data=doc, headers=headers).text
print("File handler: %s" % file_handler)
print("URL: %s/download%s" % (clarinpl_url, file_handler))

File handler: /users/default/d291fc62-d9b7-4b41-922b-312a20a92a52
URL: http://ws.clarin-pl.eu/nlprest2/base/download/users/default/d291fc62-d9b7-4b41-922b-312a20a92a52


### Process the package

In [13]:
import time

url = clarinpl_url + "/startTask"
lpmn = 'filezip(%s)|wcrft2|dir|makezip' % file_handler
print("LPMN: %s" % lpmn)

payload = {'lpmn': lpmn, 'user': user_mail}
headers = {'content-type': 'application/json'}

start = time.time()
task_id = requests.post(url, data=json.dumps(payload), headers=headers).text
print("Task id: %s" % task_id)

# Check task status
processing = True
file_id = None

while processing:
  data = requests.get(clarinpl_url + "/getStatus/" + task_id).text
  result = json.loads(data)
  end = time.time()
  if result["status"] == "PROCESSING":
    print("[%3d s] Status: %s; Progress: %6.2f%%" % (end-start, result["status"], result["value"]*100))
    time.sleep(1)
  elif result["status"] == "DONE":
    file_id = result["value"][0]["fileID"]
    processing = False  
    print("[%3d s] Status: DONE      ; Progress: 100.00%%" % (end-start))
  else:
    print(data)
    processing = False  
    
print("Result file id: %s" % file_id)

LPMN: filezip(/users/default/d291fc62-d9b7-4b41-922b-312a20a92a52)|wcrft2|dir|makezip
Task id: 14916758-a063-433d-a500-a1f3e8e3045f
[  0 s] Status: PROCESSING; Progress:   0.00%
[  1 s] Status: PROCESSING; Progress:   0.00%
[  2 s] Status: PROCESSING; Progress:   8.78%
[  4 s] Status: PROCESSING; Progress:  36.73%
[  5 s] Status: PROCESSING; Progress:  57.09%
[  6 s] Status: PROCESSING; Progress:  76.05%
[  7 s] Status: PROCESSING; Progress:  95.61%
[  8 s] Status: PROCESSING; Progress:  95.61%
[  9 s] Status: PROCESSING; Progress:  95.61%
[ 11 s] Status: DONE      ; Progress: 100.00%
Result file id: /requests/makezip/243e749d-e3a5-4162-a071-44d35a9514f0


### Download the result

In [14]:
path = "result.zip"

url = clarinpl_url + "/download" + file_id
print(url)
data = requests.get(url).content
file = open(path, "w+b")
file.write(data)
file.close()

print("Saved to %s" % path)

http://ws.clarin-pl.eu/nlprest2/base/download/requests/makezip/243e749d-e3a5-4162-a071-44d35a9514f0
Saved to result.zip


## Browse the result

In [15]:
import zipfile

zf = zipfile.ZipFile(path, 'r')

print("Number of documents: %d" % len(zf.namelist()))

print("")
print("First 10 files in the package:")
print(zf.namelist()[:10])

print("")
print("Content of the first file:")
data = zf.read(zf.namelist()[0]).decode("utf-8-sig")
print(data)

Number of documents: 500

First 10 files in the package:
['brexit_pl%brexit_pl.txt_file_700.txt', 'brexit_pl%brexit_pl.txt_file_274.txt', 'brexit_pl%brexit_pl.txt_file_1337.txt', 'brexit_pl%brexit_pl.txt_file_1918.txt', 'brexit_pl%brexit_pl.txt_file_1302.txt', 'brexit_pl%brexit_pl.txt_file_1934.txt', 'brexit_pl%brexit_pl.txt_file_1683.txt', 'brexit_pl%brexit_pl.txt_file_1441.txt', 'brexit_pl%brexit_pl.txt_file_626.txt', 'brexit_pl%brexit_pl.txt_file_1233.txt']

Content of the first file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>pl</orth>
    <lex disamb="1"><base>Polska</base><ctag>brev:npun</ctag></lex>
   </tok>
   <ns/>
   <tok>
    <orth>-</orth>
    <lex disamb="1"><base>-</base><ctag>interp</ctag></lex>
   </tok>
   <ns/>
   <tok>
    <orth>700</orth>
    <lex disamb="1"><base>700</base><ctag>num:pl:nom:m1:rec</ctag></lex>
   </tok>
   <tok>
    <orth>pl</orth>
    <

[Back to agenda](agenda.ipynb)