In [1]:
import pandas as pd
import numpy as np

import spacy
from spacy.matcher import Matcher
from spacy.pipeline import Sentencizer


from IPython.display import clear_output
from datetime import datetime

import PyPDF2
import tabula

import random
import os
import re
import json
import difflib
import time


In [2]:
df = pd.read_csv('..\\data\\ocred\\files_df.csv', index_col = 0)
spacy.prefer_gpu()
nlp = spacy.load("nl_core_news_lg")

## Request metadata



Metadata about wob requests can be usefull to collect. Luckily, when the government agency to which the request was send completes the request, it also gives a decision document and a inventory list in addition to the documents that were requested. With these two documents the reason for a request, the date on which the request was received and completed, the number of documents concidered, the number of documents (partially) released, and the number of documents not released can be extracted.

# Reason for request
The reason for the wob request can be found in the decision document. In this document it states in one sentence a summary of what has been requested. This summary is what needs to be extracted. This summary is usually indicated by a keyword or keywords. These keywords come down to dutch versions of "requested", "information about", or "publication of". What follows is a list of all keywords used in dutch:
- verzocht
- u verzoekt
- om informatie over
- uw verzoek ziet
- om openbaarmaking van
- more to come
When one of these keywords are found, all following text is extracted until a the next period occurs. To do this, a regular expression was used. 

keyword([^.]+?)\\.

Where keyword is one of the words or phrases listed above. The expression first finds one of these keywords and then matches any alpha-numerical character until the first period is found. Before the regular expression can be used however, the text needs some preprocessing. First, exessive newlines are removed. Second, all letters are converted to lowercase letters. This is done so that the regular expression will match the keywords even though in the text the the keywords are written with capitalization. This has to be done as regular expressions are case sensitive. Last, for any word or abbreviation in the text that includes a period where the period does not indicate the end of a sentence, said period have to be removed otherwise it will trip up the regular expression. After this, the neccessary information can be extracted.

After the regular expressions have been run, the matches need to be checked for duplicates. This can happen when the decision document states for example "the request if for information about [...]". In this case the same sentence would be matched for the keyword "request" and "information about". This would match the same sentence twice so only one of these is needed and the other is removed.

To evaluate the extractor, a different method than the other extractors was used: the Intersection over Union (IoU). With this metric precision, recall, and F1 scores are still calculated, but with different method. It uses the overlap between the ground truth and the prediction to measure the performance of a model. First the intersect and union of the ground truth and prediction are calculated. The intersect is the overlap between the set of words in the extraction and the set of words in the ground truth. The union is the combination of the two. With this the IoU metric is calculated. Using the panoptic segmentation metric \cite{Mechea_2019}, IoU can then be compared to a threshold value. If the IoU is higher then the treshold it counts as a true positive, if it is lower it counts as both a false negative and a false positive. Presicion, recall and F1 score can then be calculated.

# Relevant dates
There are two relevant dates that can be extracted from the decision document: the date on which the request was received and the date on which the request was completed. The decision document has these two dates at the beginning. The completion date is same date as when the document was made so that is always the very first date in the document. The date of request is always in the first sentence of the document as they all start with as follows: "In your letter of 01 januari 2022". This means that the first and second date that are found in the document are the relevant dates to extract. Sometimes the date a request was send is not the same as the date the request was received. This happens when the request is send by post. In this case the decision document states: "in your letter of 01 januari 2022, received on 05 januari 2022". The following regular expression was used to check if this is the case:

'ontvangen op ([^.]+?)\,'

 In this case the first and third date are the relevant dates. To actually extract the dates, the dates extractor described here TODO[REF HERE] was used. These dates can then also be used to check how long the request took to fulfull and if it was done in the time that they have. 


# Number of documents
The inventory lists contain an overview of all documents that were found that fall in the scope of the request. The list also has information about which documents were made public, which were made partially public, which were not made public and also the documents that were already public. "Openbaar" or "volledig openbaar" for public, "deels openbaar" or "gedeeltelijk openbaar" for partially public, and "niet openbaar", "reeds openbaar", or "geweigerd" for not public. The sum of these can be used to find the total number of documents considered.


## Results
The request metadata extractors in \hyperref[tab:table7]{table 7} show very varying results. All the extractors that had to do with dates (Date received, date fulfilled, number of days taken, and completed in time) all show good performance in the F\textsubscript{1} score. However it is worth to mention that this is for the most part due to the recall for all of these being 1. This means that the extractor never found a match when there was no ground truth match to be found. The precision is a lot lower meaning that it wasn't correct all of the time. 

The extractors that had to do with the number of documents preformed really low. These are: Documents considered, days taken per document, Number of public documents, Number of partially public documents, and Number of not public documents. This is mostly because of the fact that the inventory lists are necessary to calculate these and of the 1045 requests in the dataset only 256 had an one. Besides that, the way the data was stored doesn't lend itself well to extraction. Its stored in tables within a PDF documents. There are methods to retrieve the tables with Python (like Tabula used in this thesis) however these methods are far from foolproof and don't always work. Even if the table is retrieved, there is no consistency between ministries on how to make these tables which adds another layer of complexity.

The reason for request extractor does show good results with a precision of 0.722 it correctly identified almost three quarters of request reasons. The recall is higher at 0.878. These figures give an F\textsubscript{1} score of 0.793


One problem the used approach is sensetive to is mistakes made in the OCR process. The regular expressions look some keywords and, when those are found, the end of the sentence. If in the OCR process a mistake was made in on of the keywords or if the period at the end of the sentence wasn't recognized as a period, the regular expression won't match it even though it should. 

Another problem is the generalizibility. These extractors were made specificly for desicion documents coming from wob requests to dutch government ministries. They will not work for wob request to provinces or muninipalities. They use different standards which do not work with the current extractors.

In [3]:
# some of the dirs id wob request as 1, 2,3 and some do it as 1.0, 2.0, 3.0
# this makes all of the dirs use the first method
def fixNamingSceme():
    base = 'F:\\Data files\\Master thesis\\verzoeken\\'
    for dir in os.listdir(base):
        for folder in os.listdir(base + dir):
            if os.path.isdir(base + dir + '\\' + folder):
                os.rename(base + dir + '\\' + folder, base + dir + '\\' + folder.split('.')[0])


### Cells for dates

In [4]:
# dates matcher
def getDates(text, nlp):
    months = ['januari', 'februari', 'maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', 'oktober', 'november', 'december',
         'january', 'february', 'march', 'april', 'may', 'june', 'juli', 'august', 'september', 'october', 'november', 'december',
         'jan', 'feb', 'mrt', 'apr', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'okt']
    days = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag', 'zondag',
       'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']

    datesPattern = [ 
           {"IS_DIGIT": True}, 
           {"LOWER" : {"IN" : months}},
           {"IS_PUNCT" : True, "OP" : "?", "TEXT":'.'},
           {"IS_DIGIT": True}]
    matcher = Matcher(nlp.vocab)
    matcher.add("Dates", [datesPattern])
    
    text = re.sub(r'[^\w\s]', '', text)
    text = removeKenmerkDate(text)
    
    doc = nlp(text)
    matches = matcher(doc)

    return [doc[start:end].text for match_id, start, end in matches]


In [34]:
# converts dates to timestamp or yyyy-mm-dd
def convertDate(date, timestamp=False):
    months = {
        'januari':'01','jan':'01',
        'februari':'02','feb':'02',
        'maart':'03','mrt':'03',
        'april':'04','apr':'04',
        'mei':'05','mei':'05',
        'juni':'06','jun':'06',
        'juli':'07','jul':'07',
        'augustus':'08','aug':'08',
        'september':'09','sep':'09',
        'oktober':'10','okt':'10',
        'november':'11','nov':'11',
        'december':'12','dec':'12'
    }


    date = date.split(' ')
    date[1] = months[date[1].lower()]

    if timestamp:
        return pd.Timestamp(year=int(date[2]), month=int(date[1]), day=int(date[0]))
    if len(date[0]) == 1:
        date[0] = '0' + date[0]
    return date[2] + '-' + date[1] + '-' + date[0]

# calculate days between two dates
def days_between(d1, d2):
    d1 = convertDate(d1)
    d2 = convertDate(d2)
    try:
        d1 = datetime.strptime(d1, "%Y-%m-%d")
        d2 = datetime.strptime(d2, "%Y-%m-%d")
    except:
        return None
    return abs((d2 - d1).days)


In [32]:
# extracts all fields for dates
def dateInformation(text):        
    # find dates in text with date extractor
    matches = getDates(text, nlp)

    # first found date is date when document was written
    # the date the wob request was completed
    if len(matches) < 3:
        return None, None, None, None
    completedDate = matches[0]

    # check if request was received on a different date then when it was send
    # this is the case if it states "ontvangen op"
    receivedDate = re.findall('ontvangen op ([^.]+?)\,', text)
    if not receivedDate:
        receivedDate = matches[1]
    else:
        receivedDate = matches[2]

    # calculate days between request and completion
    daysTaken = days_between(completedDate, receivedDate)
    
    # converts to yyyy-mm-dd
    start = convertDate(receivedDate)
    end = convertDate(completedDate)
    print(start, end)
    
    # converts to np.datetime 
    try:
        start = np.datetime64(start, format='Y%-%m-%d')
        end = np.datetime64(end, format='Y%-%m-%d')
        businessDaysTaken = np.busday_count(start, end)
        inTime = businessDaysTaken <= 42
    except:
        inTime = None
    
    # check if request was fulfilled wihtin 42 business days

    return receivedDate, completedDate, daysTaken, inTime


### Cleaning cells

In [7]:
# because the extractor used periods as indicators, abbreviations like N.V.T. need to be removed
# this function converts it to N V T while keeping the periods at the end of sentences
def removeAbbreviation(text):
    sentence = ''

    # split sentence in words
    for word in text.split(' '):

        # if there is no period in the word, add it back to sentence
        if word.count('.') == 0 or '\n' in word:
            sentence += word + ' '
            continue
        
        # when there is one period in word
        elif word.count('.') == 1:

            # if it is at the end, keep it and add word to sentence
            if word[-1] == '.':
                sentence += word + ' '
            
            # if its in the middle replace it with a space
            else:
                word = word.replace('.', ' ')
                sentence += word + ' '
        
        # if there are  more than 1 periods, replace them all
        else:
            word = word.replace('.', ' ')
            sentence += word + ' '
            
    return sentence


In [8]:
# edge case when the sidebar of the doc is earlier than the heading
# in that case we have to remove a date that is not used
def removeKenmerkDate(text):
    text = text.split('\n')
    
    toRemove = None
    for i in range(len(text)):
        if '(kenmerk)' in text[i] and 'ons' not in text[i]:
            toRemove = [i, i+1]
            break
            
    if toRemove:
        text.remove(text[toRemove[0]])
        text.remove(text[toRemove[1]])

    text = '\n'.join(text)                        
    
    return text


### request reason

In [9]:
# use regex to find the request reason from decision doc
def getRequestReason(text, removeAbbr = True):

    # text = rawText.replace('\n', ' ')
    text = text.lower()
    if removeAbbr:
        text = removeAbbreviation(text)

    patterns = ['verzocht([^.]+?)\.', 
                'u verzoekt([^.]+?)\.', 
                'om informatie over([^.]+?)\.', 
                'uw verzoek ziet([^.]+?)\.', 
                'om openbaarmaking van([^.]+?)\.']
    matches = []

    # get matches for all keywords
    for pattern in patterns:
        matches += re.findall(pattern, text)

    uniqueMatches = []

    # check all matches against eachother
    for i in range(len(matches)):
        for j in range(len(matches)):
            if i == j:
                continue
            
            # if there is a duplicate we do not add it to uniqueMatches
            if matches[i] in matches[j]:
                break

        # add match to uniqueMatches if the j loop completes
        else:
            uniqueMatches.append(matches[i])

    return matches



### Retrieving text

In [10]:
# this function gets a random decision doc from a given ministry
def getRandomDecisionDoc(ministrie):

    # set paths 
    baseDFPath = '..\\data\\openstate data\\'
    basePDFPath = 'F:\\Data files\\Master thesis\\verzoeken\\'

    # get df of ministry
    for f in os.listdir(baseDFPath):
        if ministrie in f.lower():
            file = f
            break

    # get dir of ministry 
    for d in os.listdir(basePDFPath):
        if ministrie in d.lower():
            dir = basePDFPath + d + '\\'
            break
    
    requests = [x for x in os.listdir(dir) if x != '.DS_Store']
    requestNr = int(random.choice(requests))
    
    # load in dataframe of ministry and get random sample
    testDf = pd.read_excel(baseDFPath + file)
    testDf.columns = [x.replace('\n', '') for x in testDf.columns]    
    s = testDf[testDf['WOB Verzoek'] == requestNr]

    if not os.path.exists(dir + str(requestNr)):
        return None, None, None

    # find the desicion document
    for p in os.listdir(dir + str(requestNr)):

        # if found, save it in pdfPath
        if 'besluit' in p.lower() and 'bijlage' not in p.lower() and p.endswith('.pdf'):
            pdfPath = p
            break
    
    # if there is no desicion document, try again
    else:
        return None, None, None

    return s, dir + str(requestNr) + '\\', pdfPath


In [11]:
# extract text with pdftotext
def textExtract(dir, name, startPage = 0, endPage = 10):
    txtName = '.'.join(name.split('.')[0:-1])
    txtName = dir + txtName + f'-pages{startPage}-{endPage}' + '.txt'

    # extract text
    os.system(f'pdftotext -f {startPage} -l {endPage} -raw "{dir}{name}" "{txtName}"')
    
    if not os.path.exists(txtName): 
        return None
    
    # open file and return content
    with open(txtName, 'r', encoding='utf8') as f:
        text = f.read()
    return text



### Inventory document cells

In [12]:
def inventory(path, pdf):

    # words to look for
    rating = {'deels openbaar':0, 'niet openbaar':0, 'openbaar':0, 'reeds openbaar':0, 
            'geweigerd':0, 'gedeeltelijk openbaar':0,
            'volledig openbaar': 0}
    
    # if an inventory document exists, use that
    for file in os.listdir(path):
        if 'inventaris' in file.lower():
            rating = inventoryListToDataframe(path + file, rating)
            break
    
    # else try to find inventory table in decision doc
    if not rating:
        rating = inventoryListToDataframe(path + pdf, rating)

    if not rating:
        return None, None, None, None

    # combine categories
    notPublic = rating['niet openbaar'] + rating['geweigerd'] + rating['reeds openbaar']
    partialPublic = rating['deels openbaar'] + rating['gedeeltelijk openbaar']
    public = rating['openbaar'] - rating['niet openbaar'] - rating['reeds openbaar'] - partialPublic
    total = public + notPublic + partialPublic
    print(rating)
    if total == 0:
        return None, None, None, None
    
    return public, notPublic, partialPublic, public + notPublic + partialPublic

In [13]:
# makes dataframes from table in pdf
# then counts occurences of 
def inventoryListToDataframe(pdf, rating):
    try:
        tables = tabula.read_pdf(pdf, pages='all')


        if len(tables) == 0:
            return None

        for table in tables:
            for col in table.columns:
                col = list(table[col])
                col = [str(x).lower() for x in col]
                for key in rating:
                    for value in col:
                        if key in value:
                            rating[key] += 1
    except:
        return None

    return rating


### Others

In [14]:
# converts description of n documents to int
def getSumOfNumbersFromString(string):
    try:
        if type(string) == str:
            return sum([int(s) for s in string.split() if s.isdigit()])
        elif type(string) == float or type(string) == int:
            return int(string)
    except:
        return None
        

In [15]:
# find number of pages in pdf documents
def getNumberOfPages(path):
    nPages = 0

    # gets list of pdf files in a directory
    pdfs = [x for x in os.listdir(path) if x.endswith('.pdf')]
    
    # open all files and count pages
    for file in pdfs:
        try:
            with open(path + file, 'rb') as f:
                pdf = PyPDF2.PdfFileReader(f, strict=False)
                nPages += pdf.numPages
        except:
            return None
    return nPages


In [16]:
# finds the number of considered documents in the decision doc
def nDocs(text):

    # look for the first mention of a number of documents
    nDocuments = re.findall('[0-9]+? documenten[a-z]{2} aangetroffen', text)
    
    if not nDocuments:
        nDocuments = re.findall('[0-9]+? document[a-z]{2}', text)
    
    if not nDocuments:
        if len(re.findall('één document', text)) == 1:
            return 1

    # return the first if found, else return None
    if type(nDocuments) == list:
        if len(nDocuments) == 0:
            return None
        return int(nDocuments[0].split(' ')[0])
    elif type(nDocuments) == int:
        return nDocuments

    return None

### Performance calculations

In [17]:
def calculateIoU(a, b):
    a = set(a)
    b = set(b)

    intersection = 0

    for value in a:
        if value in b:
            intersection += 1

    union = len(a) + len(b) - intersection
    return intersection / union

In [18]:
def calcPerformanceV2(results):
    performance = {}
    for key in results:       
        precision = results[key]['tp'] / (results[key]['tp'] + results[key]['fp'])
        recall = results[key]['tp'] / (results[key]['tp'] + results[key]['fn'])
        if precision == 0 and recall == 0:
            f1 = 0
        else:
            f1 = 2 * ((recall * precision)/(recall + precision))
        performance[key] = {}
        performance[key]['precision'] = round(precision,3)
        performance[key]['recall'] = round(recall,3)
        performance[key]['f1'] = round(f1,3)
        print(key)
        print(results[key])
        print(performance[key])

    return performance

def calculatePerformanceFinal(version):
    with open('..\\data\\results\\openstate' + version +'.json', 'r') as f:
        results = json.load(f)
        p = calcPerformanceV2(results)
        
    with open('..\\data\\results\\openstate performance' + version +'json', 'w') as f:
        json.dump(p, f)

In [20]:
def evaluateFinal(version):
    mins = ['lnv','az','buza','bzk','ezk','fin','ienw','jenv','ocw','szw','vws']
    
    resultsDir = 'C:\\Users\\justin\\OneDrive - UvA\\Studie\\Data Science\\Thesis\\Knowledge extraction\\data\\results\\'
    
    # get results doc
    if 'openstate' + version +'.json' in os.listdir(resultsDir):
        with open(resultsDir + 'openstate' + version +'.json', 'r') as f:
            results = json.load(f)
            docsChecked = sum([results['received'][key] for key in results['received']])
    # if theres no results doc yet, create new one
    else:
        docsChecked = 0
        results = {}
        cats = ['reason','received', 'awnser', 'days taken', 'in time', 'docs considered', 'days per doc', 'n pages', 'public', 'not public', 'partial']
        for cat in cats:
            results[cat] = {'tp' : 0, 'fn' : 0, 'fp' : 0}
    

    while docsChecked < 200:

        # this selects a request with good text and retrieves the ground truth from the openstate files
        while True:
            ministry = random.choice(mins)
            clear_output()
            print(f'number of docs checked {docsChecked}')
            groundTruth, pdfPath, pdfName = getRandomDecisionDoc(ministry)
            
            if pdfPath == None:
                continue
            
            print(f'Path: {pdfPath}{pdfName}')
            print('\n______________________________')   
            
            # extract text from document
            text = textExtract(pdfPath, pdfName)
            if text == None:
                continue
            print(text[:2000])
            
            # check if the text actually contains the decision document
            goodText = input('Good text? ')
            if goodText.lower() != 'n':
                break
            if goodText == 'q':
                with open('..\\data\\results\\openstate' + version +'.json', 'w') as f:
                    json.dump(results, f)
                return results
            
        docsChecked += 1
        
        text = text.replace('\n', ' ')
        
            
        # extract metadata
        reason = getRequestReason(text)
        if len(reason) == 0:
            reason = ''
        else:
            match = reason[0]
        receivedDate, completedDate, daysTaken, inTime = dateInformation(text)       
        nDocuments = nDocs(text)
        nPages = getNumberOfPages(pdfPath)
        public, notPublic, partialPublic, total = inventory(pdfPath, pdfName)
        
        # tabula has a lot of output, this clears it
        clear_output()

        # convert dates to pd Timestamp dates to compare to ground truth      
        try:
            if receivedDate:
                receivedDate = convertDate(receivedDate, True)
        except:
            receivedDate = None
        try:
            if completedDate:
                completedDate = convertDate(completedDate, True)
        except:
            completedDate = None

        # if number of documents was not found in text, use total docs from inventory list
        if not nDocuments and total:
            nDocuments = total        

        # calculate days per doc
        if daysTaken and nDocuments:
            daysPerDoc = round(daysTaken / nDocuments, 2)
        elif total and daysTaken:
            daysPerDoc = round(daysTaken / total, 2)
        else:
            daysPerDoc = None
        
        # ground truth data
        desc = groundTruth['Soort aanvraag'].values[0]
        received = groundTruth['Datum van binnenkomst'].values[0]
        awnser = groundTruth['Datum van antwoord'].values[0]
        daysTakenGT = groundTruth['Aantal dagen in behandeling'].values[0]
        inTimeGT = groundTruth['Binnen de termijn afgehandeld'].values[0]
        nDocsConsidered = groundTruth['Aantal overwogen documenten'].values[0]
        daysPerDocGT = groundTruth['Aantal dagen nodig gehad per document'].values[0]
        nPagesGT = groundTruth["Omvang document (aantal pagina's)"].values[0]
        publicGT = groundTruth["Volledig verstrekte documenten"].values[0]
        notPublicGT = groundTruth["Niet verstrekte documenten"].values[0]
        partialPublicGT = groundTruth["Deels verstrekte documenten"].values[0]

        # extract n docs from ground truth values
        publicGT = getSumOfNumbersFromString(publicGT)
        notPublicGT = getSumOfNumbersFromString(notPublicGT)
        partialPublicGT = getSumOfNumbersFromString(partialPublicGT)
        
        if nDocsConsidered == 0:
            nDocsConsidered = None

        # get inTimeGT to correct true/false format
        if inTimeGT == 'Ja':
            inTimeGT = True
        elif inTimeGT == 'Nee':
            inTimeGT = False
        else:
            inTimeGT = None


        values = {
            'received' : [pd.Timestamp(received), receivedDate], 
            'awnser' : [pd.Timestamp(awnser), completedDate], 
            'days taken' : [daysTakenGT, daysTaken], 
            'in time' : [inTimeGT, inTime], 
            'docs considered' : [nDocsConsidered, nDocuments], 
            'days per doc' : [round(daysPerDocGT, 2), daysPerDoc], 
            'n pages' : [nPagesGT, nPages], 
            'public' : [publicGT, public], 
            'not public' : [notPublicGT, notPublic], 
            'partial' : [partialPublicGT, partialPublic]
        }
        
        for key in values:
            if type(values[key][0]) == float  or type(values[key][0]) == np.float64:
                if np.isnan(values[key][0]):
                    values[key][0] = None
        
        # compare all extracted data to ground truth values
        for cat in values:
            
            # if ground truth and extractor are equal but not both none, its tp else tn
            if values[cat][1] == values[cat][0]:
                if values[cat][1] != None and values[cat][0] != None:
                    results[cat]['tp'] += 1
            
            # if extractor didnt find anything, its missing
            elif values[cat][1] == None:
                results[cat]['fn'] += 1
                
            # if they do not equal, its spurious
            else:
                results[cat]['fp'] += 1
        

        desc = desc.replace('\n', ' ')
        match = match.replace('\n', ' ')
        print(results)

        IoU = calculateIoU(set(desc), set(match))
        if IoU > .5:
            results['reason']['tp'] += 1
        else:
            results['reason']['fp'] += 1
            results['reason']['fn'] += 1
        
        with open('..\\data\\results\\openstate' + version +'.json', 'w') as f:
                json.dump(results, f)
                
evaluateFinal('final')
calculatePerformanceFinal('final')
                

{'reason': {'tp': 178, 'fn': 21, 'fp': 21}, 'received': {'tp': 135, 'fn': 0, 'fp': 65}, 'awnser': {'tp': 135, 'fn': 0, 'fp': 65}, 'days taken': {'tp': 117, 'fn': 0, 'fp': 83}, 'in time': {'tp': 134, 'fn': 0, 'fp': 66}, 'docs considered': {'tp': 71, 'fn': 72, 'fp': 36}, 'days per doc': {'tp': 48, 'fn': 76, 'fp': 54}, 'n pages': {'tp': 179, 'fn': 0, 'fp': 21}, 'public': {'tp': 5, 'fn': 53, 'fp': 32}, 'not public': {'tp': 8, 'fn': 81, 'fp': 29}, 'partial': {'tp': 9, 'fn': 117, 'fp': 28}}
reason
{'tp': 179, 'fn': 21, 'fp': 21}
{'precision': 0.895, 'recall': 0.895, 'f1': 0.895}
received
{'tp': 135, 'fn': 0, 'fp': 65}
{'precision': 0.675, 'recall': 1.0, 'f1': 0.806}
awnser
{'tp': 135, 'fn': 0, 'fp': 65}
{'precision': 0.675, 'recall': 1.0, 'f1': 0.806}
days taken
{'tp': 117, 'fn': 0, 'fp': 83}
{'precision': 0.585, 'recall': 1.0, 'f1': 0.738}
in time
{'tp': 134, 'fn': 0, 'fp': 66}
{'precision': 0.67, 'recall': 1.0, 'f1': 0.802}
docs considered
{'tp': 71, 'fn': 72, 'fp': 36}
{'precision': 0.664

## Test generalizability

The following is to test the generalizability of the extractors
I've used wob requests from the munincipality of Amsterdam

In [35]:
fileName = r'C:\Users\justin\OneDrive - UvA\Studie\Data Science\Thesis\Knowledge extraction\data\amsterdam csv\amsterdam_files_df.csv'
pdfs = r'F:\Data files\Master thesis\pdfs amsterdam\\'
dfAmstedam = pd.read_csv(fileName)
dfAmstedam = dfAmstedam[['full_name', 'name', 'page', 'text']]

requests = os.listdir(pdfs)


for i in range(20):
    request = random.choice(requests)
    print(request)
    pdfPath = pdfs + request
    for file in os.listdir(pdfPath):
        if 'besluit' in file.lower():
            pdfName = pdfPath + '\\' + file
            break
    
    dfRequest = dfAmstedam[dfAmstedam['full_name'].str.contains(request)]

    dfRequest = dfRequest.sort_values(by=['page'])
    pages = list(dfRequest.head(10).text.values)
    pages = [str(x) for x in pages]
    text = ' '.join(pages) 

    reason = getRequestReason(text)
    if len(reason) == 0:
        reason = ''
    else:
        match = reason[0]
    receivedDate, completedDate, daysTaken, inTime = dateInformation(text)       
    nDocuments = nDocs(text)
    nPages = getNumberOfPages(pdfPath)
    public, notPublic, partialPublic, total = inventory(pdfPath, pdfName)
    
    # tabula has a lot of output, this clears it
    clear_output()

    # convert dates to pd Timestamp dates to compare to ground truth      
    try:
        if receivedDate:
            receivedDate = convertDate(receivedDate, True)
    except:
        receivedDate = None
    try:
        if completedDate:
            completedDate = convertDate(completedDate, True)
    except:
        completedDate = None

    # if number of documents was not found in text, use total docs from inventory list
    if not nDocuments and total:
        nDocuments = total        

    # calculate days per doc
    if daysTaken and nDocuments:
        daysPerDoc = round(daysTaken / nDocuments, 2)
    elif total and daysTaken:
        daysPerDoc = round(daysTaken / total, 2)
    else:
        daysPerDoc = None


    print('reason: ', reason)
    print('receivedDate: ', receivedDate)
    print('completedDate: ', completedDate)
    print('daysTaken: ', daysTaken)
    print('inTime: ', inTime)
    print('nDocuments: ', nDocuments)
    print('nPages: ', nPages)
    print('daysPerDoc: ', daysPerDoc)
    print('public: ', public)
    print('notPublic: ', notPublic)

    
    print('partialPublic: ', partialPublic)
    time.sleep(1)
    input()



reason:  
receivedDate:  None
completedDate:  None
daysTaken:  None
inTime:  None
nDocuments:  None
nPages:  None
daysPerDoc:  None
public:  None
notPublic:  None
partialPublic:  None
