# Nafigator Tutorial
This tutorial helps you to get started with the Nafigator package. We will:
1. Create a catalog file: this contains the metadata of your file. 
2. Create a naf file: this contains all the content of you text file. 
3. Access the text layer to understand the structure of NAF and do some basic search querries
4. Store your NAF files

This tutorial is set up for one pdf file. You can also import multiple files at the same time

## Getting ready

In [1]:
#import your packages
import nafigator
import os
import stanza
import pandas as pd
from nafigator.parse2naf import generate_naf

In [2]:
#Download and specify your (English) NLP engine
stanza.download('en')
stanza_nlp = stanza.Pipeline('en')

2022-02-14 08:11:14 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-02-14 08:11:14 INFO: Use device: cpu
2022-02-14 08:11:14 INFO: Loading: tokenize
2022-02-14 08:11:14 INFO: Loading: pos
2022-02-14 08:11:15 INFO: Loading: lemma
2022-02-14 08:11:15 INFO: Loading: depparse
2022-02-14 08:11:15 INFO: Loading: sentiment
2022-02-14 08:11:15 INFO: Loading: constituency
2022-02-14 08:11:16 INFO: Loading: ner
2022-02-14 08:11:16 INFO: Done loading processors!


## 1. Create a catalog file
Cataloge files contain the meta data of your files. Here you can store the correct information
We use the [Dublin Core Metadata initiative](https://www.dublincore.org/resources/userguide/publishing_metadata/#exCon1).

In [None]:
#create a catalog file containing your metadata
df_catalog=pd.DataFrame(columns =['dc:identifier', 'dc:source', 'dc:relation', 'dc:creator', 'dc:format', 'dc:language', 'dc:type', 'dc:coverage', 'naf:status'])

In [None]:
doc_name = os.path.join("..", "data", "external", "naf", "annual_report_dnb_2020.naf.xml")

In [None]:
#add the metadata of the document to the dataframe
df_catalog=nafigator.doc2catalog(doc_name, df_catalog, document_source = None, document_type="annual report", document_year= 2020)

## 2. Create a naf file
Depending on the length of your document, the creation of a NAF file containing all the relevant layers may take a while.
In this example we use "DNB's annual report"[maybe insert another file]

In [None]:
#generate the NAF file with an input (pdf) file. The data folder contains an example for you to use, but you can specify the file you want to analyse by changing 'input'.
doc = generate_naf(input = "../data/external/naf/annual_report_dnb_2020.pdf",
                   engine = "stanza",
                   language = "en",
                   naf_version = "v3.1",
                   dtd_validation = False,
                   params = {'fileDesc': {'author': 'anonymous'}},
                   nlp = None)

## 3. Access the text layer to understand the structure of NAF and perform some basic search querries

In [None]:
#Congratulations! You have now your first NAF file. To access the plain text of the document, run the below command.
doc.raw

When working with NAF, it's important to understand the structure of the naf.xml file.

In [None]:
#To get information about the the layers in the NAF file, use the header function:
doc.header

As you can see, each layer is a different dictionary containing a list.
For more examples, please check: https://pypi.org/project/nafigator/

### The structure - how to use the text layer
You can use doc.text to access the following elements from the text layer:
- id: the id of the word form
- sent: sentence id of the word form
- para: paragraph id of the word form
- offset: the offset (in charachters) of the word form
- length: the length (in charachters) of the word form

In [None]:
#By printing this result you will see the structure of the file. It is a set of dictionaries within a list.
doc.text

In [6]:
#To access a specific part of the document, you can use lists to go through the entire file. Let's say you want to access the 4240th word in this file.
doc.text[4239]

{'text': 'economy',
 'id': 'w4240',
 'sent': '165',
 'para': '41',
 'page': '11',
 'offset': '24016',
 'length': '7'}

In [None]:
#If you want to extract specific values from the file (for example, the page number where )
doc.text[4239]["page"]

### Search
NAF makes multiple types of search possible. Here we'll show you two types:
- Exact search
- Lemmatized search

### Exact Search

In [None]:
#The 4239th word is 'economy'. If you want to search all sentences that contain this word, you loop through the entire text file as shown below. This will show you all the sentences that contain 'economy'.
[word["sent"]for word in doc.text if word["text"]=="economy"]

In [11]:
[word["page"]for word in doc.text if word["text"]=="table"]

['3',
 '110',
 '130',
 '131',
 '131',
 '132',
 '132',
 '133',
 '140',
 '141',
 '142',
 '142',
 '144',
 '148',
 '149',
 '151',
 '154',
 '156',
 '157',
 '157',
 '158',
 '159',
 '163',
 '165',
 '165',
 '165',
 '166',
 '167',
 '167',
 '172',
 '174',
 '176',
 '177',
 '177']

### Lemmatized Search


In [None]:
#Gather all the lemmas
lemmas = [term['lemma'] for term in doc.terms]
#Search for combined terms
economy=nafigator.sublist_indices(["economy"], lemmas)
print(economy)

### Maybe not necessary

In [None]:
#However, sometimes you want to do lematized search instead
#first you gather the terms and words
doc_terms = {term['id']: term for term in doc.terms}
doc_words = {word['id']: word for word in doc.terms}
#Or a combination of words


### Getting more from page text

In [7]:
# Print the text on the 22nd page
[word['text'] for word in doc.text if word['page'] == '22']

['21',
 'Globally',
 ',',
 'interest',
 'rates',
 'have',
 'been',
 'low',
 'for',
 'some',
 'time',
 'now',
 'due',
 'to',
 'a',
 'number',
 'of',
 'long-term',
 'trends',
 ',',
 'which',
 'have',
 'been',
 'reinforced',
 'by',
 'the',
 'coronavirus',
 'crisis',
 '.',
 'For',
 'banks',
 ',',
 'long-',
 'term',
 'low',
 'interest',
 'rates',
 'put',
 'net',
 'interest',
 'income',
 'under',
 'pressure',
 ',',
 'as',
 'these',
 'are',
 'only',
 'passed',
 'on',
 'to',
 'deposit',
 'rates',
 'to',
 'a',
 'limited',
 'extent',
 '.',
 'Pension',
 'funds',
 'and',
 'insurers',
 'are',
 'also',
 'being',
 'affected',
 'by',
 'the',
 'increased',
 'costs',
 'of',
 'building',
 'up',
 'fully',
 'funded',
 'pensions',
 'and',
 'life',
 'insurance',
 'policies',
 '.',
 'It',
 'is',
 'important',
 'that',
 'financial',
 'institutions',
 'and',
 'society',
 'continue',
 'to',
 'adapt',
 'to',
 'the',
 'low',
 'interest',
 'rate',
 'environment',
 '.',
 'Low',
 'interest',
 'rates',
 'due',
 'to',


### Printing Specific Sentences

In [8]:
#printing a sentence based on a sentence number
sentence =  doc.sentences[23]
print("Sentence: " + str(sentence["text"])+"\n")

Sentence: Nevertheless , we face significant economic uncertainty in the short term .



### Get Insight in the Entities
NAF is great at storing information about the text in different layers. For example, you might want to list all the entities that are named in this document.

In [4]:
# To get all the entities in the text, use the below command
entities=doc.entities

# If you only want to extract the organisations you can do so as well
[doc['text']for doc in doc.entities if doc['type'] == "ORG"]


['Nederlandsche Bank',
 'DNB',
 'De Nederlandsche Bank N.V.',
 '5 Governing Board',
 'Supervisory Board',
 'Bank Council',
 'Employees Council',
 'the European Central Bank',
 'ECB',
 'EU',
 'ECB',
 'Nederlandsche Bank 6',
 'the Financial Stability Board',
 'FSB',
 'Commission',
 'ECB',
 'ECB',
 'ECB',
 'Nederlandsche Bank',
 'the European Council',
 'COVID',
 'IMF',
 'the World Bank',
 'ECB',
 'ECB',
 'ECB',
 'the Fiscal Space Working Group',
 'ECB',
 'the National Forum on the Payment System',
 'NFPS',
 'DNB',
 'Nederlandsche Bank',
 'ECB',
 'NOW',
 'Next Generation EU',
 'Nederlandsche Bank',
 'DNB',
 'ECB',
 'the Pandemic Emergency Purchase Programme',
 'PEPP',
 'TLTROs',
 'TLTROs',
 'PEPP',
 'ECB',
 'ECB',
 'DNB',
 'ECB',
 'Nederlandsche Bank',
 'ECB',
 'DNB',
 'CPB Netherlands Bureau for Economic Policy Analysis',
 'Nederlandsche Bank',
 'DNB',
 'DNB',
 'ECB',
 'ECB',
 'DNB',
 'ECB',
 'ECB',
 'DNB',
 'ECB',
 'ECB',
 'IMF',
 'G7',
 'ReportDe Nederlandsche Bank 22',
 'ECB',
 'ECB',

## 4. Storing your NAF File
After you have generated your NAF file, you probably want to store it for later use. Especially since this is the most time consuming part of your analysis.

In [None]:
#store you document naf file as xml 
doc.write("../data/external/naf/annual_report_dnb_2020.naf.xml")

In [3]:
#if you want to reuse it later, import an existing naf file as shown below
doc_name = os.path.join("..", "data", "external", "naf", "annual_report_dnb_2020.naf.xml")
#doc = nafigator.NafDocument().open('notebook_data/output.naf.xml')
doc=nafigator.NafDocument().open(doc_name)