# Nafigator Tutorial
This tutorial helps you to get started with the Nafigator package. We will:

1. Create a naf file
2. Retrieving information from NAF files
3. Store your NAF files

This tutorial is set up for one pdf file. You can also import multiple files at the same time

## Getting ready

In [None]:
#import your packages
import nafigator
import os
import stanza
import pandas as pd
from nafigator.parse2naf import generate_naf

In [None]:
# Download and specify your (English) NLP engine
stanza.download('en')
stanza_nlp = stanza.Pipeline('en')

# 1. Create a naf file

Depending on the length of your document, the creation of a NAF file containing all the relevant layers may take a while.
In this example we use "DNB's annual report 2020"

In [None]:
#generate the NAF file with an input (pdf) file. The data folder contains an example for you to use, but you can specify the file you want to analyse by changing 'input'.
file_name = "../data/external/naf/annual_report_dnb_2020.pdf"
if not os.path.exists(file_name):
    doc = generate_naf(input = file_name,
                       engine = "stanza",
                       language = "en",
                       naf_version = "v3.1",
                       dtd_validation = False,
                       params = {'fileDesc': {'author': 'anonymous'}},
                       nlp = stanza_nlp)
else:
    doc = nafigator.NafDocument().open(file_name[:-3]+"naf.xml")

Congratulations! You have now your first NAF file. To access the plain text of the document, run the below command.

# 2. Retrieving information from naf documents

When working with NAF, it's important to understand the structure of the naf.xml file.

### The raw layer

The raw layer contains the complete string of the document without annotations

In [None]:
doc.raw[0:122]

### The header layer

The header layer contains all meta data of the naf file: file description, public information and information about the linguistic processors used.

In [None]:
# To get information about the the layers in the NAF file, use the header function:
doc.header.keys()

In [None]:
doc.header['fileDesc']

Public data definitions follow the Dublin Core Metadata Initiative: http://purl.org/dc/elements/1.1/

In [None]:
doc.header['public']

Documents are parsed with a NLP engine (in this case stanza) consisting of different pipeline elements.

In [None]:
doc.header['linguisticProcessors']

As you can see, each layer is a different dictionary containing a list.
For more examples, please check: https://pypi.org/project/nafigator/

### The text layer

You can use doc.text to access the following elements from the text layer:
- id: the id of the word form
- sent: sentence id of the word form
- para: paragraph id of the word form
- offset: the offset (in charachters) of the word form
- length: the length (in charachters) of the word form

In [None]:
# To access a specific part of the document, 
# you can use lists to go through the entire file. 
# Let's say you want to access the 4240th word in this file.
doc.text[4239]

In [None]:
# If you want to extract specific values from the file (for example, the page number where the word occurs)
doc.text[4239]["page"]

In [None]:
# The offset is aligned with the raw layer, so:

In [None]:
doc.raw[24016:24016+7]

### The terms layer

The terms layer contains linguistic and morphological properties of each word.
- id: the id of the term
- type: open or closed term
- lemma: the lemma of the term
- pos: the part-of-speech of the term
- morphofeat: the morphological features of the term
- span: the ids of the wordform of this term

In [None]:
doc.terms[4239]

In this case the number in the list is identical to the number in the list of the wordform, but in general this is not the case. Corresponding wordforms should be retrieved with the span in the term.

### The entities layer

In [None]:
# the different type of entities that are found
print(set([entity['type'] for entity in doc.entities]))

In [None]:
# the first entities that are recognized by the NLP engine
print([entity['text'] for entity in doc.entities[0:100] if entity['type']=='ORG'])

In [None]:
# Standard NLP engine make errors in recognizing entities:
print([entity['text'] for entity in doc.entities[0:100] if entity['type']=='WORK_OF_ART'])
# (although some might disagree in this case :) )

### The dependency layer

## Key word search
You can search NAF documents in several ways. Here we'll show you two types:
- Exact search
- Lemmatized search

### Exact Search

In [None]:
#The 4239th word is 'economy'. If you want to search all sentences that contain this word, you loop through the entire text file as shown below. This will show you all the sentences that contain 'economy'.
print([word["id"] for word in doc.text if word["text"]=="economy"])

### Lemmatized Search


In [None]:
# Gather the first word id of all term where lemma is economy
print([term['span'][0]['id'] for term in doc.terms if term['lemma']=='economy'])

In [None]:
# retrieve one of the word ids that did not come up with the exact search
print([word['text'] for word in doc.text if word["id"]=="w985"])

### Getting more from page text

In [None]:
# Print the text on the 22nd page
print(" ".join([word['text'] for word in doc.text if word['page'] == '22']))

### Printing Specific Sentences

In [None]:
# printing a sentence based on a sentence number
sentence =  doc.sentences[23]
print("Sentence: " + str(sentence["text"])+"\n")

# 3. Storing your NAF File
After you have generated your NAF file, you probably want to store it for later use. Especially since this is the most time consuming part of your analysis.

In [None]:
#store you document naf file as xml 
doc.write("../data/external/naf/annual_report_dnb_2020.naf.xml")

In [None]:
#if you want to reuse it later, import an existing naf file as shown below
doc_name = os.path.join("..", "data", "external", "naf", "annual_report_dnb_2020.naf.xml")
#doc = nafigator.NafDocument().open('notebook_data/output.naf.xml')
doc = nafigator.NafDocument().open(doc_name)