# Nedextract use case example

In this tutorial we will analyse a pdf file using all the different options that nedextract offers. As a test case we will be looking at the fictive annual report './Data/Jaarverslag_Bedrijf.pdf'.
The file fulfills the following requirements:

- the file is machine readable (does not have to be analysed using OCR)
- the file is written in Dutch

In [1]:
# Load the requirements
%matplotlib inline
import sys
sys.path.append('../')
from os.path import exists
from nedextract import run_nedextract
from nedextract import classify_organisation



In [2]:
# Use the example data file or replace the file name with your own
file = 'Data/Jaarverslag_Bedrijf.pdf'

In [3]:
# Check if the file exists
exists(file)

True

In [4]:
# Check what the run function does
help(run_nedextract.run)

Help on function run in module nedextract.run_nedextract:

run(directory=None, file=None, url=None, urlf=None, tasks=['people'], anbis=None, model=None, labels=None, vectors=None, write_output=False)
    Annual report information extraction.
    
    This function runs the full nedextract pipleline. The pipeline is originally designed to read 
    Dutch annual report pdf files from non-profit organisation and extract relevant information.
    It can extract information about people (task 'people') engaged in the organisation and the position they hold,
    entities named in the file (task 'orgs'), and/or identify the sector in which the organisation is active
    based on a provided pretrained model.
    
    
    The following general steps are taken:
    - Read in the pdf files(s) and preprocess the text.
    - (optional) Extract mentioned people from the text and identify their position within the organisation.
      Which of the people named in the text can be found to likely hold 

In [5]:
# let's extract information on people and organisations metioned in the file.
tasks = ['people', 'orgs']

# only obtain the first and third dataframe
df_p, _, df_o = run_nedextract.run(file=file, tasks=tasks)

2023-08-23 17:31:21 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


2023-08-23 17:31:21 Working on file: /Users/lootes/Documenten/Projects/Transparency_in_the_Dutch_non-profit_sector/np-transparency/Tutorials/Data/Jaarverslag_Bedrijf.pdf


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-08-23 17:31:21 INFO: Loading these models for language: nl (Dutch):
| Processor | Package |
-----------------------
| tokenize  | alpino  |
| ner       | conll02 |

2023-08-23 17:31:21 INFO: Using device: cpu
2023-08-23 17:31:21 INFO: Loading: tokenize
2023-08-23 17:31:21 INFO: Loading: ner
2023-08-23 17:31:22 INFO: Done loading processors!


2023-08-23 17:31:26 Finished file: /Users/lootes/Documenten/Projects/Transparency_in_the_Dutch_non-profit_sector/np-transparency/Tutorials/Data/Jaarverslag_Bedrijf.pdf
in opdp
The start time was:  2023-08-23 17:31:21
The end time is:  2023-08-23 17:31:26


### Extracted people

In [None]:
# Check out the resulting dataframe with info on mentioned people:
df_p

In [28]:
# show lists of individual column contents
print('the mentioned people are:')
df_p['Persons'].to_list()[0].split("\n")[:-1]

# ambassadors
print('the ambassadors are:')
df_p['Ambassadors'].to_list()[0].split("\n")[:-1]

# board members
print('the board members are:')
df_p['Board_members'].to_list()[0].split("\n")[:-1]

#job descriptions 
print('The are the job positions:')
df_p['Job_description'].to_list()[0].split("\n")[:-1]

['A.B. de Wit',
 'Anna de Wit',
 'Bernard Zwartjes',
 'Cornelis Geel',
 'D.A. Rooden',
 'Dirkje Rooden',
 'E. van Grijs',
 'Eduard van Grijs',
 'F. de Blauw',
 'Ferdinand de Blauw',
 'G. Roze',
 'Gerard Roze',
 'H. Doe',
 'Hendrik Doe',
 'Hendrik Groen',
 'Isaak Paars',
 'J. Doe',
 'Jan van Oranje',
 'Jane Doe',
 'Karel',
 'Lodewijk',
 'Maria',
 'Mohammed El Idrissi',
 'Mr. H. Hendrik Groen',
 'Nico',
 'Otto',
 'Pieter',
 'Sarah',
 'Saïda Benali',
 'Thomas']

### Extracted organisations

In [32]:
df_o

Unnamed: 0,Input_file,mentioned_organization,n_mentions
0,Jaarverslag_Bedrijf.pdf,ABCbank,1
1,Jaarverslag_Bedrijf.pdf,Bedrijf2,1
2,Jaarverslag_Bedrijf.pdf,FGH,1
3,Jaarverslag_Bedrijf.pdf,Firma Accountancy,1
4,Jaarverslag_Bedrijf.pdf,L9PA Foundation,1
5,Jaarverslag_Bedrijf.pdf,NL00ABCB00012345678,1
6,Jaarverslag_Bedrijf.pdf,Stichting Non-Profit,1
7,Jaarverslag_Bedrijf.pdf,Universiteit Opleindscentrum,1


### Extracting more data from a directory

Data can also be extracted from a directory containing pdf files. In that case instead of `run_nedextract.run(file=file, tasks=tasks)`, run:

`run_nedextract.run(dir=my_Directory, tasks=tasks)`

### Extract data from a url link or multiple links
You can also run a pdf file from a url location so that you don't have to save your data locally with:

`run_nedextract.run(url=my_url, tasks=tasks)`

A number of url's to online pdf files can also be used if the links saved in a text file. Create a text file "urls.txt":<br>
myurl1<br>
myurl2<br>
myurl3<br>
...<br>
<br>
and run:<br>

`run_nedextract.run(urlf=urls.txt, tasks=tasks)`

### Save the output
Run the run function with the `write_output=True` argument to save the output to excel files.

### Classifying sectors

In order to classify the organisation, a pretrained model has to be provided. You can pretrain your model with your own training data using the `extract_pdf.classofy_organisation.train` function

In [38]:
help(classify_organisation.train)

Help on function train in module nedextract.classify_organisation:

train(data: pandas.core.frame.DataFrame, train_size: float, alpha: float, save: bool = False)
    Train a MultinomialNB classifier to classify texts into the main sector categories.
    
    This function trains a Multinomial Naive Bayes classifier to classify text data into
    main sector categories. It uses the given dataset ('data') containing 'text' and 'Sector'
    columns. The 'text' column contains the textual data, and the 'Sector' column represents
    the main sector categories that the texts belong to.
    
    The function performs the following steps:
    1. Factorizes the 'Sector' column, which contains the labels, to convert categories into numerical labels.
    2. Splits the data into training and testing sets based on the specified 'train_size'.
    3. Applies Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to the
       training data to transform text features into numerical vectors.