## Problem statement

Dates in the wob documents can come in a variaty of formats. Different documents use different standerds, so in this notebook I will explore how effectively I can extract dates from wob documents. 


## Approach

To do this I have chosen three for each of the two main types of docements. One is a decicion document, basicelly the response of the government to the wob request. The other is the actual data that the government released containing everything from emails and memos to internal government documents.

## links to datasets
- [Besluit doc 1](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d6b6d7c3080a7e30e39e8be2451f35d8_besluit-op-uw-wob-verzoek-dd-16-december-2020-inzake-uitgezonderde-detailhandel-tijdens-lockdown.pdf.txt)
- [Besluit doc 2](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/dc034afbaede3d587451c7062fd857e7_besluit.pdf.txt)
- [Besluit doc 3](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/ec59d18227ae89899be3f69f57879a60_besluit-op-bezwaar-tegen-het-wob-besluit-over-amvs-in-griekenland-en-hun-huisvesting.pdf.txt)
- [Documenten doc 1](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d6b6d7c3080a7e30e39e8be2451f35d8_documenten-wob-detailhandel.pdf.txt)
- [Documenten doc 2](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d3581109c8c5876da8795f87eb424634_documenten-wob-verzoek-dd-11-juni-2021-inzake-uitgezonderde-detailhandel-tijdens-de-lockdown.pdf.txt)
- [Documenten doc 3](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/eb375aba46cb1fb46f895c04c1744906_stukken-wob-verzoek-hema-gelakt-muv-stukken-uitgestelde-verstrekking.pdf.txt)


I've chosen for these documents because they are both of acceptable size to manually label all the dates while hopefully being large enough to still give make a adequate testset. 


For something to classify as a date, it should at least have a day and month. This means 4 juli would be classified as a date but juli 2021 would not. If you look at the [file with the dates](https://github.com/JustinBon/thesis/blob/main/data/dates.txt) there is one pattern that is by far the most used: (optional) day of the week - day - month - (optional) year. The dashes here represent spaces. Some examples are Woensdag 1 april, 1 april, 14 april 2020, and zaterdag 12 december 2020. These dates can be easily added to the spacy pipeline. The other types of dates are like 2020-12-15, 2/12/19, 18-Mar-2020 and 15-Jun. These need to be done with a regex. This is because spacy works with tokens and these types of formats are considered one token and you cant find a pattern with just one token.

In [88]:
import spacy
from spacy.matcher import Matcher
import os
import re
from datetime import datetime
spacy.prefer_gpu()

False

For this first part I used the spacy matcher because most of the dates consist of mutliple tokens. This means a matcher can be made to find all the dates in this pattern.

In [2]:
months = ['januari', 'februari', 'maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', 'oktober', 'november', 'december']
days = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag', 'zondag']
sent = ['datum', 'verzonden', 'sent', 'date']

In [19]:
def findPattern(pattern):  
    base = r'..\\data\\covid wob text without ocr\\'
    nlp = spacy.load("nl_core_news_lg")
    matcher = Matcher(nlp.vocab)
    matcher.add("Dates", [pattern])

    with open('..\\data\\dates.txt', 'r', encoding='utf-8') as f:
        dates = f.read()
        dates = dates.split('\n\n')
    
    results = {}
    
    for file in dates:
        lines = file.split('\n')

        with open(base + lines[0] +'.txt', 'r', encoding='utf-8') as f:
            text = f.read()
            text = re.sub(' +', ' ', text)
        results[lines[0]] = []
        doc = nlp(text)
        matches = matcher(doc)

        previous = 0
        temp = []
        for match_id, start, end in matches:
            temp.append((start, end, doc[start:end].text))

        for i in range(len(temp)):
            if i + 1 == len(temp):
                results[lines[0]].append(temp[i][2])
                break

            if temp[i][0] == temp[i+1][0]:
                continue

            if temp[i][2] in temp[i - 2][2] and temp[i][2] != temp[i - 2][2]:
                continue

            results[lines[0]].append(temp[i][2])
    return results

In [17]:
def printResults(results):
    for k in results:
        print(k)
        for match in results[k]:
            print(match)

In [159]:
datesPattern = [{"LOWER" : {"IN" : days}, "OP" : "?"}, 
           {"IS_DIGIT": True}, 
           {"LOWER" : {"IN" : months}},
           {"IS_DIGIT": True, "OP" : "?"}]
s = findPattern(datesPattern)

In [21]:
# WORK IN PROGRESS
# This findes dates that are preceded by things like datum, verzonden or send. These are then probably when the document
# was created or send

emailPattern = [{"LOWER" : {"IN" : sent}},
           {"IS_PUNCT" : True, "OP" : "?"},
           {"LOWER" : {"IN" : days}, "OP" : "?"}, 
           {"IS_DIGIT": True}, 
           {"LOWER" : {"IN" : months}},
           {"IS_DIGIT": True, "OP" : "?"}]
printResults(findPattern(emailPattern))

d6b6d7c3080a7e30e39e8be2451f35d8_besluit-op-uw-wob-verzoek-dd-16-december-2020-inzake-uitgezonderde-detailhandel-tijdens-lockdown.pdf
d6b6d7c3080a7e30e39e8be2451f35d8_documenten-wob-detailhandel.pdf
Verzonden: zaterdag 12 december 2020
Verzonden: zaterdag 12 december 2020
Verzonden: maandag 14 december 2020
Verzonden: dinsdag 15 december 2020
Verzonden: dinsdag 15 december 2020
Datum 2 december 2020
Verzonden: woensdag 16 december 2020
d3581109c8c5876da8795f87eb424634_documenten-wob-verzoek-dd-11-juni-2021-inzake-uitgezonderde-detailhandel-tijdens-de-lockdown.pdf
Datum 31 december 2020
Datum 15 januari 2021
Datum:dinsdag 19 januari 2021
Verzonden: dinsdag 19 januari 2021
Verzonden: dinsdag 19 januari 2021
Verzonden: dinsdag 19 januari 2021
Verzonden: maandag 18 januari 2021
Verzonden: dinsdag 19 januari 2021
Verzonden: dinsdag 19 januari 2021
Verzonden: dinsdag 12 januari 2021
Datum: woensdag 6 januari 2021
Verzonden: dinsdag 22 december 2020
Verzonden: maandag 21 december 2020
Datum 2

This part is with regular expressions. This is for the dates that would only consist of one token. I have identified the following date formats:

- dd/mm, dd-mm
- dd/mm/yy, dd/mm/yyyy, dd-mm-yy, dd-mm-yyyy
-

In [201]:
def validate(dates, sep, pat):
    goodDates = []
    
    for date in dates:
        try:
            date = date.replace(sep, ' ')
            datetime.strptime(date, pat)
            goodDates.append(date.replace(' ', sep))
        except:
            try:
                if len(date.split(' ')) == 3 and len(date.split(' ')[2]) == 2:
                    datetime.strptime(date, '%d %m %y')
                    goodDates.append(date.replace(' ', sep))
            except:
                pass     
    return goodDates
            

def regexMatcher():
    base = r'..\\data\\covid wob text without ocr\\'

    with open('..\\data\\dates.txt', 'r', encoding='utf-8') as f:
        dates = f.read()
        dates = dates.split('\n\n')
    
    results = {}
    
    for file in dates:
        lines = file.split('\n')

        with open(base + lines[0] +'.txt', 'r', encoding='utf-8') as f:
            text = f.read()
            text = re.sub(' +', ' ', text)
        results[lines[0]] = []
        
#         temp = re.findall('[0-3][0-9]\/[0-1][0-9]', text) 
        results[lines[0]] += validate(re.findall('[0-3][0-9]\/[0-1][0-9]', text), '/', '%d %m')
        
#         temp = re.findall('[0-3][0-9]\/[0-1][0-9]\/[0-9]{2,4}', text) 
        results[lines[0]] += validate(re.findall('[0-3][0-9]\/[0-1][0-9]\/[0-9]{2,4}', text), '/', '%d %m %Y')
        
#         temp = re.findall('[0-3][0-9]-[0-1][0-9]', text)
        results[lines[0]] += validate(re.findall('[0-3][0-9]-[0-1][0-9]', text), '-', '%d %m')
        
#         temp = re.findall('[0-3][0-9]-[0-1][0-9]-[0-9]{2,4}', text)
        results[lines[0]] += validate(re.findall('[0-3][0-9]-[0-1][0-9]-[0-9]{2,4}', text), '-', '%d %m %Y')
        
    return results

In [202]:
printResults(regexMatcher())


d6b6d7c3080a7e30e39e8be2451f35d8_besluit-op-uw-wob-verzoek-dd-16-december-2020-inzake-uitgezonderde-detailhandel-tijdens-lockdown.pdf
d6b6d7c3080a7e30e39e8be2451f35d8_documenten-wob-detailhandel.pdf
14/12
d3581109c8c5876da8795f87eb424634_documenten-wob-verzoek-dd-11-juni-2021-inzake-uitgezonderde-detailhandel-tijdens-de-lockdown.pdf
13/12
20/12
20/12/16
14-12
15-12
24-01
14-12
15-12
24-01
24-01
24-01
14-12
15-12
20-12
20-12
20-12
20-12-15
20-12-15
20-12-15
dc034afbaede3d587451c7062fd857e7_besluit.pdf
ec59d18227ae89899be3f69f57879a60_besluit-op-bezwaar-tegen-het-wob-besluit-over-amvs-in-griekenland-en-hun-huisvesting.pdf
eb375aba46cb1fb46f895c04c1744906_stukken-wob-verzoek-hema-gelakt-muv-stukken-uitgestelde-verstrekking.pdf
17/12
31/12
31/12
15/05
17/12/19
31/12/2019
31/12/2019
29-05
29-05-2020


In [195]:
def counter(found, matches):
    for keys in matches:
        for value in matches[keys]:
            if value in found:
                found[value] += 1
            else:
                found[value] = 1
    
    return found

def showError(error, found, labeledDates):
    
    for err in error:
        try:
            f = found[err]
        except:
            f = 0
            
        try:
            l = labeledDates[err]
        except:
            l = 0
        
        print(f'date: {err}, matches: {f}, labeled: {l}')
    
    
    
def preformance(s):
    datesPattern = [{"LOWER" : {"IN" : days}, "OP" : "?"}, 
           {"IS_DIGIT": True}, 
           {"LOWER" : {"IN" : months}},
           {"IS_DIGIT": True, "OP" : "?"}]
    
    # s (argument) = spacy matches
    # r = regex matces
    r = regexMatcher()
    
    found = {}
    found = counter(found, s)
    found = counter(found, r)
    
    with open('..\\data\\dates.txt', 'r', encoding='utf-8') as f:
        dates = f.read()
        dates = dates.split('\n\n')
    
    labeledDates = {}
    for file in dates:
        dates = file.split('\n')
        for date in dates[1:]:
            date = date.lower()
            if date in labeledDates:
                labeledDates[date] += 1
            else:
                labeledDates[date] = 1
    
    
    tp = 0
    fp = 0
    fn = 0
    fpList = []
    tpList = []
    fnList = []
    
    for date in found:
        
        # if the matcher found things that were not in the labeled set
        if date not in labeledDates:
            fp += found[date]
            fpList.append(date)
            
        # if matcher found the same items as the labeled set
        elif found[date] == labeledDates[date]:
            tp += found[date]
            
        # if the matcher found a date more times
        elif found[date] > labeledDates[date]:
            tp += labeledDates[date]
            fp += found[date] - labeledDates[date]
            fpList.append(date)
            
        # if the matcher found a date less times
        elif found[date] < labeledDates[date]:
            tp += found[date]
            fn += labeledDates[date] - found[date]
            fnList.append(date)
            
    # for the times where neither of the matcher found a date
    for date in labeledDates:
        if date not in found:
            fn += labeledDates[date]
            fnList.append(date)
            
    print('recall', tp / (tp + fn))
    print('precision', tp / (tp + fp))
    print('')
    
    print('False negatives')
    showError(fnList, found, labeledDates)
    print('')
    print('False positives')
    showError(fpList, found, labeledDates)

    return found, labeledDates
        
        
    

In [196]:
found, labeledDates = preformance(s)

recall 0.7619047619047619
precision 0.7017543859649122

False negatives
date: 16 december 2020, matches: 13, labeled: 14
date: 15 december 2020, matches: 5, labeled: 6
date: 17 januari 2021, matches: 3, labeled: 4
date: 15 juni, matches: 6, labeled: 7
date: 16 april, matches: 1, labeled: 2
date: woensdag 16 dec. 2020, matches: 0, labeled: 2
date: 19 jan. 2021, matches: 0, labeled: 3
date: dinsdag 19 januari, matches: 0, labeled: 1
date: 13 januari 2021, matches: 0, labeled: 1
date: 25 jan. 2021, matches: 0, labeled: 1
date: zaterdag 12 dec. 2020, matches: 0, labeled: 1
date: 2020-12-15, matches: 0, labeled: 3
date: 8 juni 2021, matches: 0, labeled: 1
date: 02 dec 2019, matches: 0, labeled: 1
date: woensdag 1 april, matches: 0, labeled: 1
date: dinsdag 31 maart, matches: 0, labeled: 1
date: 28/3/20, matches: 0, labeled: 1
date: 23/3/20, matches: 0, labeled: 1
date: 9/3/20, matches: 0, labeled: 1
date: 2/12/19, matches: 0, labeled: 1
date: 18-mar-2020, matches: 0, labeled: 2
date: 15-mar

## Conclusion

Dates can relatively easily be extracted. This experiment resulted in a precicion of 0.702 and a recall of 0.762. However, after looking at the results I think that it preforms better than that. I think that I could have missed some dates that the extractor didn't miss. For example, the date "dinsdag 19 januari 2021" was found 10 times but I only had it 8 times in my testset. This can either mean that the extractor missidentified it or that I missed it when making the test set. Seeing as it is clearly a date, I think that I missed it. This would mean that, at the very least, the false positive rate would be lower than it is and thus the precision higher. If I missed dates in the testset this could also mean that the recall rate would be a bit lower than it is.

I also looked at why some dates that are clearly dates werent found by the matcher and the problem there seems to be agian my labeling. For example, I manually identifed 6 dates of "15 december 2020" and 3 of "dinsdag 15 december 2020" and the matchers found 5 "15 december 2020" and 4 "dinsdag 15 december 2020". After looking back at the data it seems that the matcher was right in this case. This goes for more of the dates. Most of the dates that have both false positives as false negatives fall in this category. 

Now there are definitely some dates that weren't found that should have been found. Most of these are in a weird format or have extra spaces in between the number: "woensdag 27/5" and "3 -nov-2019" are examples of this. Also there are false positives that are actually false positives. Most of these are in the "31/12" format. These could all be actual dates but they could aslo be two random numbers seperated by a forward slash.

In my opinion, this date matcher is more accurate than the test shows mostly because of bad labeling on my part. I think that it is accurate enought to be used.

One thing that can be very usefull is to check what proceeds the date. If it is something like "datum" or "verzonden" or something similar like the english versions of those, then you can be relatively sure that it is the date that the document was created or approved for release. I've already done something like that in a cell above, the one which states WORK IN PROGRESS


One part that inhibits the extraction of dates is the messiness of the data used. The data contains things like "soon as Feb- 27Supply for" for which the date should be Feb-24. To extract all of these kinds of dates, a lot of specialized extractors need to be made. Also you would need to first know for what you're specializing the extractors and for that you would need to manually scan the data. However this is what we're trying to prevent. Another example of this is "1 mei 20 20". This would also need a special case. These problems however are all the result of bad scanning of documents where the an extra space is added or where a space is removed. This might be (mostly) resolved with the optimal OCR settings.