In [1]:
import pandas as pd
import PyPDF2

from IPython.display import clear_output

import os
import re
import json

In [2]:
df = pd.read_csv('..\\data\\ocred\\files_df.csv', index_col = 0)

## Reason for request



One of the most important things to figure out about a Wob request is what was actually requested. Luckily, when the government agency to which the request was send to completes the request, it also gives a decision document in addition to the documents that were requested. In this decision document it states in one sentence a summary of what has been requested. This summary is what needs to be extracted. This summary is usually indicated by a keyword or keywords. These keywords come down to dutch versions of "requested", "information about", or "publication of". What follows is a list of all keywords used in dutch:
- verzocht
- u verzoekt
- om informatie over
- uw verzoek ziet
- om openbaarmaking van
- more to come
When one of these keywords are found, all following text is extracted until a the next period occurs. To do this, a regular expression was used. 

keyword([^.]+?)\\.

Where keyword is one of the words or phrases listed above. The expression first finds one of these keywords and then matches any alpha-numerical character until the first period is found. Before the regular expression can be used however, the text needs some preprocessing. First, exessive newlines are removed. Second, all letters are converted to lowercase letters. This is done so that the regular expression will match the keywords even though in the text the the keywords are written with capitalization. This has to be done as regular expressions are case sensitive. Last, for any word or abbreviation in the text that includes a period where the period does not indicate the end of a sentence, said period have to be removed otherwise it will trip up the regular expression. After this, the neccessary information can be extracted.

After the regular expressions have been run, the matches need to be checked for duplicates. This can happen when the decision document states for example "the request if for information about [...]". In this case the same sentence would be matched for the keyword "request" and "information about". This would match the same sentence twice so only one of these is needed and the other is removed.

To evaluate the extractor, a different method than the other extractors was used: the panoptic quality metric. With this metric precision, recall, and F1 scores are still calculated, but with different method. It uses the overlap between the ground truth and the prediction to measure the performance of a model. First the intersect and union of the ground truth and prediction are calculated. The intersect is the overlap between the two and the union is the combination of the two. With this the Intersection over Union (IoU) metric is calculated. TODO[INSERT CALCULTION HERE]. The IoU can then be compared to a threshold value. If the IoU is higher then the treshold it counts as a true positive, if it is lower it counts as both a false negative and a false positive. Presicion, recall and F1 score can then be calculated. 
https://medium.com/@danielmechea/panoptic-segmentation-the-panoptic-quality-metric-d69a6c3ace30

## Results

One problem the used approach is sensetive to is mistakes made in the OCR process. The regular expressions look some keywords and, when those are found, the end of the sentence. If in the OCR process a mistake was made in on of the keywords or if the period at the end of the sentence wasn't recognized as a period, the regular expression won't match it even though it should. 

In [3]:
# some of the dirs id wob request as 1, 2,3 and some do it as 1.0, 2.0, 3.0
# this makes all of the dirs use the first method
def fixNamingSceme():
    base = 'F:\\Data files\\Master thesis\\verzoeken\\'
    for dir in os.listdir(base):
        for folder in os.listdir(base + dir):
            if os.path.isdir(base + dir + '\\' + folder):
                os.rename(base + dir + '\\' + folder, base + dir + '\\' + folder.split('.')[0])


In [10]:
# because the extractor used periods as indicators, abbreviations like N.V.T. need to be removed
# this function converts it to N V T while keeping the periods at the end of sentences
def removeAbbreviation(text):
    sentence = ''

    # split sentence in words
    for word in text.split(' '):

        # if there is no period in the word, add it back to sentence
        if word.count('.') == 0 or '\n' in word:
            sentence += word + ' '
            continue
        
        # when there is one period in word
        elif word.count('.') == 1:

            # if it is at the end, keep it and add word to sentence
            if word[-1] == '.':
                sentence += word + ' '
            
            # if its in the middle replace it with a space
            else:
                word = word.replace('.', ' ')
                sentence += word + ' '
        
        # if there are  more than 1 periods, replace them all
        else:
            word = word.replace('.', ' ')
            sentence += word + ' '
            
    return sentence


In [5]:
def eval(text):

    # text = rawText.replace('\n', ' ')
    text = text.lower()
    text = removeAbbreviation(text)

    patterns = ['verzocht([^.]+?)\.', 'u verzoekt([^.]+?)\.', 'om informatie over([^.]+?)\.', 'uw verzoek ziet([^.]+?)\.', 'om openbaarmaking van([^.]+?)\.']
    matches = []

    # get matches for all keywords
    for pattern in patterns:
        matches += re.findall(pattern, text)

    uniqueMatches = []

    # check all matches against eachother
    for i in range(len(matches)):
        for j in range(len(matches)):
            if i == j:
                continue
            
            # if there is a duplicate we do not add it to uniqueMatches
            if matches[i] in matches[j]:
                break

        # add match to uniqueMatches if the j loop completes
        else:
            uniqueMatches.append(matches[i])

    print(matches)
    print(text)



In [6]:
# this function gets a random decision doc from a given ministry
def getRandomDecisionDoc(ministrie):

    # set paths 
    baseDFPath = '..\\data\\openstate data\\'
    basePDFPath = 'F:\\Data files\\Master thesis\\verzoeken\\'

    # get df of ministry
    for f in os.listdir(baseDFPath):
        if ministrie in f.lower():
            file = f
            break

    # get dir of ministry 
    for d in os.listdir(basePDFPath):
        if ministrie in d.lower():
            dir = basePDFPath + d + '\\'
            break
        
    # load in dataframe of ministry and get random sample
    testDf = pd.read_excel(baseDFPath + file)
    # testDf = testDf[['WOB Verzoek', 'Soort aanvraag', 'URL']]
    # testDf = testDf.dropna(subset=['Soort aanvraag'])
    s = testDf.sample(1)

    # get request number to find correct pdf
    requestNr = s['WOB Verzoek'].values[0]
    description = s['Soort aanvraag'].values[0]

    if not os.path.exists(dir + str(requestNr)):
        return getRandomDecisionDoc(ministrie)

    # find the desicion document
    for p in os.listdir(dir + str(requestNr)):

        # if found, save it in pdfPath
        if 'besluit' in p.lower() and 'bijlage' not in p.lower():
            pdfPath = p
            break
    
    # if there is no desicion document, try again
    else:
        return getRandomDecisionDoc(ministrie)

    return description, dir + str(requestNr) + '\\', pdfPath


In [12]:
def textExtract(dir, name):
    txtName = '.'.join(name.split('.')[0:-1])
    txtName = dir + txtName + '.txt'
    
    # clean up from previous messups
    for f in os.listdir(dir):
        if f.endswith('.txt') and f != txtName:
            os.remove(dir + f)


    # check if text is already extracted
    if not os.path.exists(txtName):
        os.system(f'pdftotext -f 0 -l 10 -raw "{dir}{name}"')

    # open file and return content
    with open(txtName, 'r', encoding='utf8') as f:
        text = f.read()
    return text


In [14]:
mins = ['lnv','az','buza','bzk','ezk','fin','ienw','jenv','ocw','szw','vws']
for min in mins:
    groundTruth, pdfPath, pdfName = getRandomDecisionDoc(min)
    print(groundTruth, pdfPath, pdfName)
    text = textExtract(pdfPath, pdfName)
    eval(text)

    x = input()
    if x == 'q':
        break
    else:
        clear_output()




Verzoek om 
openbaarmaking van documenten over de toepassing van de Wet Arbeid Vreemdelingen F:\Data files\Master thesis\verzoeken\WOB-verzoeken VWS\85\ Bes-Besluit+Wob-verzoek+over+toepassing+artikel+8+Wet+Arbeid+Vreemdelingen.pdf
[' om\ninformatie over de toepassing van de bevoegdheid op grond van artikel 8, derde\nlid, onder a, b, en c, van de wet arbeid vreemdelingen (wav)', ' wordt om bewijs te leveren dat u bevoegd\nbent tot het indienen van het bezwaar']
> retouradres postbus 20350 2500 ej den haag
datum: 15 november 2019
betreft: besluit op uw wob-verzoek
ministerie van volksgezondheid,
welzijn en sport
directie wetgeving en
juridische zaken
bezoekadres:
parnassusplein 5
2511 vx den haag
t 070 340 79 11
f 070 340 59 84
www.rijksoverheid.n1
inlichtingen bij
ons kenmerk
2019.123
1613829-198823-wjz
geachte
in uw brief van 18 juli 2019, door mij ontvangen op 21 augustus 2019, heeft u
met een beroep op de wet openbaarheid van bestuur (hierna: wob) verzocht om
informatie over de toep

KeyboardInterrupt: Interrupted by user

In [8]:
def getDataFrameOfMinistry(ministrie):

    baseDFPath = '..\\data\\openstate data\\'

    for f in os.listdir(baseDFPath):
        if ministrie in f.lower():
            file = f
            break

    return pd.read_excel(baseDFPath + file)



bzk = getDataFrameOfMinistry('bzk')
bzk.columns

Index(['WOB Verzoek', 'Onderwerp', 'Datum van binnenkomst',
       'Datum van antwoord', 'Aantal dagen \nin behandeling',
       'Binnen de \ntermijn afgehandeld',
       'Omvang document (aantal pagina's)\n', 'Volledig verstrekte documenten',
       'Deels verstrekte documenten', 'Niet verstrekte documenten',
       'Aantal overwogen \ndocumenten',
       'Aantal dagen nodig \ngehad per document', 'Soort aanvraag',
       'Bijzonderheden', 'URL', 'Kolom1'],
      dtype='object')

datum ontvangen:
- in uw brief van
- in uw mail van
- tweede gevonden datum
- ontvangen op
- derde datum

datum verwerkt:
- datum
- eerst gevonden datum <== best option

inventaris aantal docus
- some number of numbers + documenten, then the first


- op basis van uw verzoek zijn er (bij de meeste)
- documenten (bij alle)
- bij kopje inventarisatie documenten


F:\Data files\Master thesis\verzoeken\WOB-verzoeken VWS\85\ Bes-Besluit+Wobverzoek+over+toepassing+artikel+8+Wet+Arbeid+Vreemdelingen.pdf


