In [57]:
import pandas as pd
import PyPDF2

from IPython.display import clear_output

import os
import re
import json

In [12]:
df = pd.read_csv('..\\data\\ocred\\files_df.csv', index_col = 0)

## Reason for request



One of the most important things to figure out about a Wob request is what was actually requested. Luckily, when the government agency to which the request was send to completes the request, it also gives a decision document in addition to the documents that were requested. In this decision document it states in one sentence a summary of what has been requested. This summary is what needs to be extracted. This summary is usually indicated by a keyword or keywords. These keywords come down to dutch versions of "requested", "information about", or "publication of". What follows is a list of all keywords used in dutch:
- verzocht
- u verzoekt
- om informatie over
- uw verzoek ziet
- om openbaarmaking van
- more to come
When one of these keywords are found, all following text is extracted until a the next period occurs. To do this, a regular expression was used. 

keyword([^.]+?)\\.

Where keyword is one of the words or phrases listed above. The expression first finds one of these keywords and then matches any alpha-numerical character until the first period is found. Before the regular expression can be used however, the text needs some preprocessing. First, exessive newlines are removed. Second, all letters are converted to lowercase letters. This is done so that the regular expression will match the keywords even though in the text the the keywords are written with capitalization. This has to be done as regular expressions are case sensitive. Last, for any word or abbreviation in the text that includes a period where the period does not indicate the end of a sentence, said period have to be removed otherwise it will trip up the regular expression. After this, the neccessary information can be extracted.

After the regular expressions have been run, the matches need to be checked for duplicates. This can happen when the decision document states for example "the request if for information about [...]". In this case the same sentence would be matched for the keyword "request" and "information about". This would match the same sentence twice so only one of these is needed and the other is removed.

To evaluate the extractor, a different method than the other extractors was used: the panoptic quality metric. With this metric precision, recall, and F1 scores are still calculated, but with different method. It uses the overlap between the ground truth and the prediction to measure the performance of a model. First the intersect and union of the ground truth and prediction are calculated. The intersect is the overlap between the two and the union is the combination of the two. With this the Intersection over Union (IoU) metric is calculated. TODO[INSERT CALCULTION HERE]. The IoU can then be compared to a threshold value. If the IoU is higher then the treshold it counts as a true positive, if it is lower it counts as both a false negative and a false positive. Presicion, recall and F1 score can then be calculated. 
https://medium.com/@danielmechea/panoptic-segmentation-the-panoptic-quality-metric-d69a6c3ace30

## Results

One problem the used approach is sensetive to is mistakes made in the OCR process. The regular expressions look some keywords and, when those are found, the end of the sentence. If in the OCR process a mistake was made in on of the keywords or if the period at the end of the sentence wasn't recognized as a period, the regular expression won't match it even though it should. 

In [6]:
def removeAbbreviation(text):
    sentence = ''
    for word in text.split(' '):
        if word.count('.') == 0:
            sentence += word + ' '
            continue

        elif word.count('.') == 1:
            if word[-1] == '.':
                sentence += word + ' '
                
            else:
                word = word.replace('.', ' ')
                sentence += word + ' '
            
        else:
            word = word.replace('.', ' ')
            sentence += word + ' '
    return sentence


In [4]:
def eval():
    test = df[(df.full_name.str.contains('_besluit') & (df.page == 1))]

    for t in test.full_name:
        rawText = df[df.full_name == t].text.values[0]
        text = rawText.replace('\n', ' ')
        text = text.lower()
        text = removeAbbreviation(text)

        patterns = ['verzocht([^.]+?)\.', 'u verzoekt([^.]+?)\.', 'om informatie over([^.]+?)\.', 'uw verzoek ziet([^.]+?)\.', 'om openbaarmaking van([^.]+?)\.']
        matches = []

        for pattern in patterns:
            matches += re.findall(pattern, text)

        uniqueMatches = []

        for i in range(len(matches)):
            for j in range(len(matches)):
                if i == j:
                    continue
                if matches[i] in matches[j]:
                    break
            else:
                uniqueMatches.append(matches[i])



        
        print(matches)
        print(rawText)
        if input() == 'q':
            break
        clear_output()

[' om een overzicht van de agenda-afspraken van de minister, alsmede beide staatssecretarissen en alle leden van de bestuursraad van het ministerie van financiën']
Ministerie van Financiën

 

> Retouradres Postbus 20201 2500 EE Den Haag

Directie Juridische Zaken

Korte Voorhout 7
2511 CW Den Haag
Postbus 20201

2500 EE Den Haag
www.rijksoverheid.nl

Ons kenmerk
2021-0000046938

Uw brief (kenmerk)

Datum 25 maart 2021
Betreft Besluit op uw Wob-verzoek inzake agenda-afspraken
Geachte

In uw brief van 10 juni 2020, ontvangen op 15 juni 2020, heeft u

met een beroep op de Wet openbaarheid van bestuur (hierna: Wob)

verzocht om een overzicht van de agenda-afspraken van de minister, alsmede
beide staatssecretarissen en alle leden van de bestuursraad van het ministerie van
Financiën.

De ontvangst van uw verzoek heb ik u schriftelijk bevestigd bij brief van 23 juni
2020, met kenmerk 2020-0000118052. In deze brief is tevens de beslistermijn met
vier weken verdaagd.

U heeft uw Wob-verzoek ev

In [44]:
baseDFPath = '..\\data\\openstate data\\'
basePDFPath = 'F:\\Data files\\Master thesis\\verzoeken\\'

mins = ['lnv','az','buza','bzk','ezk','fin','ienw','jenv','ocw','szw','vws']


for ministrie in mins:
    for f in os.listdir(baseDFPath):
        if ministrie in f.lower():
            file = f
            break
    for d in os.listdir(basePDFPath):
        if ministrie in d.lower():
            dir = d
            break
    
    data = pd.read_excel(baseDFPath + file)
    data = data[['WOB Verzoek', 'Soort aanvraag']]
    data = data.dropna(subset=['Soort aanvraag'])

    for i in range(10):
        sample = data.sample(1)
        requestNr = sample['WOB Verzoek'].values[0]
        description = sample['Soort aanvrag'].values[0]

        



In [55]:
def getRandomDecisionDoc(ministrie):

    baseDFPath = '..\\data\\openstate data\\'
    basePDFPath = 'F:\\Data files\\Master thesis\\verzoeken\\'

    for f in os.listdir(baseDFPath):
        if ministrie in f.lower():
            file = f
            break
    for d in os.listdir(basePDFPath):
        if ministrie in d.lower():
            dir = d + '\\'
            break

    testDf = pd.read_excel(baseDFPath + file)
    testDf = testDf[['WOB Verzoek', 'Soort aanvraag']]
    testDf = testDf.dropna(subset=['Soort aanvraag'])
    s = testDf.sample(1, random_state=1)

    requestNr = s['WOB Verzoek'].values[0]
    description = s['Soort aanvraag'].values[0]

    for p in os.listdir(basePDFPath + dir + str(float(requestNr))):
        if 'besluit' in p:
            pdfPath = p
            break

    return description, basePDFPath + dir + str(float(requestNr)) + '\\' + pdfPath


In [52]:
def textExtract(pdfPath):

    with open(pdfPath, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        text = pdfReader.getPage(0).extractText()
    return text



In [58]:
groundTruth, pdfPath = getRandomDecisionDoc('bzk')
print(groundTruth, pdfPath)
text = textExtract(pdfPath)

print(text)

Verzoek om documenten over adopties uit Haïti in de periode 1980 – 1995 F:\Data files\Master thesis\verzoeken\WOB-verzoeken BZK\50.0\Bes-Aanvullend+WOB+besluit+adopties+Haiti+1980+-+1995.pdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Dat um
 
01 fe bruari 2021
 
 
Betreft
:
 
Aanvullend be sluit
 
Wob verzoek adopties Haiti
 
 
 
Pagina
 1 
van
 6  
 
Direc tie Co n su la ire Za ken  en 
Visumbe le id
 
C onsulaire  A ange legemhede n
 
Postb us 20061
 
2500 EB De n Haag
  
Nederlan d
 
www.rijksove rhe id.n l
 
Con ta ctp ersoon
 
DCV
-
CA
 
T
 
 
E
 
 
 
On ze referentie
 
 
Kopie aan
 
 
Bijla ge (n)
 
 
 
 
 
 
 
 
Ge acht e
 
me vrouw ,
 
 
In aanvulling op mijn eerdere besluit van 16 oktober 2020 
inzake uw 
beroep op de W et openbaarheid van bestuurd (hierna: W ob) namens  
, 
bericht
 
ik u als volgt. Voor een omschrijving van het 
W ob
-
verzoek en het procesverloop zij verwezen naar het besluit van 16 
oktober 2020.
 
 
Uw bezwaar tegen het besluit van 16 oktober 2020 ziet op grond va




indicators

gevraagd om openbaarmaking
U geeft aan graag meer inzicht te krijgen
verzoek ingediend om openbaarmaking van
Uw verzoek heeft betrekking

