## Demo: Information Extraction from Administrative Decisions

This notebook contains a demo for information extraction from Administrative Decisions (enforcement decisions, admnistrative fines or penalties). The project is explained in the thesis 'Combining rule-based and machine learning methods for effective information extraction on administrative decisions'. Github link: https://github.com/Harry-Nan/IE-administrative-decisions/

### Data
The data is obtained from [Woogle](https://woogle.wooverheid.nl/search?q=*). The governing bodies [Kansspelautoriteit](kansspelautoriteit.nl/) (Dutch gambling authority, KSA) and the [Autoriteit Financiële Markten](afm.nl/) (Dutch financial markets authority, AFM) are used in this project. For this demo, information from one KSA administrative fine will be extracted ([download pdf](https://pid.wooverheid.nl/?pid=nl.ab5.2k.2014.1.bijlage.1)). The following code imports this data, obtained from Woogle, into a DataFrame.

In [1]:
# import KSA-decision
import pandas as pd

all_ksa_decisions = pd.read_csv("D:\\Documents\\UvA\\CITaDOG\\ksa_1.csv").set_index('foi_documentId')
decision = all_ksa_decisions.loc['nl.ab5.2k.2014.1.bijlage.1']

# Display data
display(decision[['foi_bodyTextOCR']].head(5))

# Display shape of DataFrame
print(decision.shape)

Unnamed: 0_level_0,foi_bodyTextOCR
foi_documentId,Unnamed: 1_level_1
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Beslui...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 nov...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 nov...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 ...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 ...


(14, 54)


### Data preprocessing: Remove header/footer
Headers and footers are removed from the data. Additionally, excessive spaces etc. are removed. This code assumes the data is obtained per page. In order to do this, SpaCy’s pre-trained pipeline ’nl_core_news_lg’ will be imported and used.

In [2]:
# Import and load nlp-model
from scripts.spacy_model import spacy_nlp
nlp = spacy_nlp()

Successfully loaded nlp-model.


In [3]:
# Import scripts
from scripts.clean_text import run_clean_text

# Remove headers/footers, excessive spacing
decision = run_clean_text(decision)

display(decision[['foi_bodyTextOCR', 'cleaned_text']].head(5))
print(decision.shape)

Unnamed: 0_level_0,foi_bodyTextOCR,cleaned_text
foi_documentId,Unnamed: 1_level_1,Unnamed: 2_level_1
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Beslui...,Besluit van de Raad van Bestuur van de Kansspe...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 nov...,Inhoudsopgave Samenvatting. … sss sens veeenen...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 nov...,Inleiding 1. De Kansspelautoriteit is belast m...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 ...,Onderzoek winkel Star SAT Electronica 2.1 Het ...
nl.ab5.2k.2014.1.bijlage.1,OPENBAAR Kansspelautoriteit Datum 21 ...,"bedoeld in artikel 30b, eerste lid, aanhef en ..."


(14, 57)


### Selection / categorization of data

As explained in section 3.1, a selection of administrative fines and penalties is made. The following script checks if a document is an administrative fine and administrative penalty.

In [4]:
from scripts.categorization import hasFilterWords, removeAdviceRevelation, convertToSents, identifyLegalEffect

# Put text into list for easier accessability
list_of_pages = list(decision['cleaned_text'])

# loop to go through all steps identified in 3.1:

# Step 1: keyword extraction
step1 = hasFilterWords(list_of_pages, category='sanctie') # category can be sanctie (fine) or dwangsom (penalty)
if step1:
    print('Step 1 completed: doc contains keyword(s).')
    
    # Step 2: remove advices
    list_of_pages = removeAdviceRevelation(list_of_pages)
    if list_of_pages:
        print('Step 2 completed: non-advice text found.')
        
        # Step 3: identify Legal Effect
        list_of_sentences = convertToSents(list_of_pages) # Necessary step to identify Legal Effect
        possible_legal_effects = identifyLegalEffect(list_of_sentences)
        
        if possible_legal_effects:
            print('Step 3 completed: possible legal effect(s) found:', possible_legal_effects[0])
            print('Document is an administrative fine.')
        else:
            print('Step 3 failed: no legal effect identified.')        
    else:
        print('Step 2 failed: document is an advice.')   
else:
    print('Step 1 failed: doc contains no keyword(s)')

Successfully loaded nlp-model.
Successfully loaded nlp-model.
Step 1 completed: doc contains keyword(s).
Step 2 completed: non-advice text found.
Step 3 completed: possible legal effect(s) found: ['€ 20.000', '€ 20.000', '€ 20.000.']
Document is an administrative fine.


Afterwards, objection to appeal decisions are removed. Disclosure decisions are already removed from the text in step 2 in the code above. The removal of objection to appeal decisions is done in the following way:

In [5]:
from scripts.categorization import isPrimair

isprimair = isPrimair(list_of_pages) # removes text that is identified to be disclosure/objection to appeal decision

if isprimair:
    print('Document is not an objection to appeal decision.')
else:
    print('Document is an objection to appeal decision.')

Document is not an objection to appeal decision.


# 3.2.1 Extraction of Date

The following steps are taken to extract the date, as explained in the thesis:
1. Matching on date-patterns.
2. Presence of date in header/footer. 
3. Keyword matching on 'Datum' or Dutch city. [External data](https://metatopos.dijkewijk.nl/) that contains all Dutch cities is used.

In [6]:
from scripts.extract_date import extract_date

# Extract date
date = extract_date(decision.loc['nl.ab5.2k.2014.1.bijlage.1'])[1]
print(date[0])

#df['date'] = date

21/11/2013


## 3.2.2 Extraction of Violated Article and Legal Basis

The following steps are taken to extract the Violated Article and Legal Basis, as explained in the thesis:

1. Keyword matching ('artikel').
2. POS-tagging to obtain full article.
3. Context-aware matching (keywords).

In [7]:
# Convert to sentences
from scripts.text_to_sents import convertToSents
list_of_sentences = convertToSents(list_of_pages)

# Extract Violated Article / Legal Basis
from scripts.extract_article import find_articles
articles, articles_idxs = find_articles(list_of_sentences, nlp)

print(articles[:3])

Successfully loaded nlp-model.
['artikel 35a van de Wet op de kansspelen', 'artikel 3, boek 2 van het Burgerlijk Wetboek).', 'artikel 5:48 van de Algemene wet bestuursrecht.']


### 3.2.3 Extraction of Legal Effect

The following steps are taken to extract the legal effect, as explained in the thesis:

1. Matching on money-patterns.
2. POS-tagging to obtain associated noun.
3. Keyword matching.

In [8]:
# Extract legal Effect(s)
from scripts.extract_legal_effect import extract_legal_effect
legal_effect_all = extract_legal_effect(list_of_sentences)
legal_effect = legal_effect_all[0]
legal_effect_idx = legal_effect_all[1]

print(f'Identified legal effect(s):', legal_effect)

Successfully loaded nlp-model.
Identified legal effect(s): ['€ 20.000', '€ 20.000', '€ 20.000.']


## Obtain sentences based on 3.2.2 and 3.2.3.

In [9]:
# Obtain sentences based on identified idxs
from scripts.candidateSentences import selectCandidateSentences
candidate_sents = selectCandidateSentences(legal_effect_idx, articles_idxs, list_of_sentences)

# Print last two sections of extracted sentence(s)
print(candidate_sents[2:])

['Hij is verantwoordelijk voor de gang van zaken in de winkel en als eigenaar van de eenmanszaak aansprakelijk voor de gehele bedrijfsvoering. De aanwezigheid van de gokzuil in de winkel is aan hem toe te rekenen. 39, Derhalve is de eigenaar aan te merken als overtreder van artikel 30t, eerste lid, aanhef en onder c, van de Wok. 21 Hof Arnhem-Leeuwarden,', 'Inleiding 40. 41. 42. 43. 44, De Raad van Bestuur van de Kansspelautoriteit is ingevolge artikel 35a van de Wok bevoegd een boete op te leggen van ten hoogste het bedrag van de zesde categorie (artikel 23 van het Wetboek van Strafrecht) of - indien dit meer is — 10% van de omzet in het boekjaar voorafgaand aan de beschikking. Bij de vaststelling van de boete houdt de Raad rekening met de ernst van de overtreding en de mate waarin deze aan de overtreder kan worden verweten. Zo nodig houdt de Raad rekening met de omstandigheden waaronder de overtreding is gepleegd (artikel 5:46 van de Algemene wet bestuursrecht). Ingevolge artikel 3.4

## Analyse sentences using ChatGPT (3.3), and extract information.

The following code obtains a response from ChatGPT after prompting it the prompt as explained in the thesis.

Note: an API key from openAI is required to run this code.

In [10]:
# Analyse sentences using ChatGPT, and extract information
from scripts.askGPT import askGPT, extractInformation
response = askGPT(candidate_sents, category='sanctie')

In [11]:
df = extractInformation(response)

# include date (that was previously extracted)
df['date'] = date

df

Unnamed: 0,legal effect,ontvanger,overtreden_artikel,type of activity,dma,legal basis,date
0,[20000],[eigenaar van de elektronicawinkel Star SAT El...,"[[artikel 30t, eerste lid, aanhef en onder c, ...",Het aanwezig hebben van een speelautomaat (gok...,Raad van Bestuur van de Kansspelautoriteit,artikel 35a van de Wet op de kansspelen,21/11/2013
