## Problem statement

Dates in the wob documents can come in a variaty of formats. Different documents use different standerds, so in this notebook I will explore how effectively I can extract dates from wob documents. 


## Approach

To do this I have chosen three for each of the two main types of docements. One is a decicion document, basicelly the response of the government to the wob request. The other is the actual data that the government released containing everything from emails and memos to internal government documents.

## links to datasets
- [Besluit doc 1](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d6b6d7c3080a7e30e39e8be2451f35d8_besluit-op-uw-wob-verzoek-dd-16-december-2020-inzake-uitgezonderde-detailhandel-tijdens-lockdown.pdf.txt)
- [Besluit doc 2](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/dc034afbaede3d587451c7062fd857e7_besluit.pdf.txt)
- [Besluit doc 3](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/ec59d18227ae89899be3f69f57879a60_besluit-op-bezwaar-tegen-het-wob-besluit-over-amvs-in-griekenland-en-hun-huisvesting.pdf.txt)
- [Documenten doc 1](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d6b6d7c3080a7e30e39e8be2451f35d8_documenten-wob-detailhandel.pdf.txt)
- [Documenten doc 2](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/d3581109c8c5876da8795f87eb424634_documenten-wob-verzoek-dd-11-juni-2021-inzake-uitgezonderde-detailhandel-tijdens-de-lockdown.pdf.txt)
- [Documenten doc 3](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/eb375aba46cb1fb46f895c04c1744906_stukken-wob-verzoek-hema-gelakt-muv-stukken-uitgestelde-verstrekking.pdf.txt)


I've chosen for these documents because they are both of acceptable size to manually label all the dates while hopefully being large enough to still give make a adequate testset. 


For something to classify as a date, it should at least have a day and month. This means 4 juli would be classified as a date but juli 2021 would not. If you look at the [file with the dates](https://github.com/JustinBon/thesis/blob/main/data/dates.txt) there is one pattern that is by far the most used: (optional) day of the week - day - month - (optional) year. The dashes here represent spaces. Some examples are Woensdag 1 april, 1 april, 14 april 2020, and zaterdag 12 december 2020. These dates can be easily added to the spacy pipeline. The other types of dates are like 2020-12-15, 2/12/19, 18-Mar-2020 and 15-Jun. These need to be done with a regex. This is because spacy works with tokens and these types of formats are considered one token and you cant find a pattern with just one token.

In [9]:
import spacy
from spacy.matcher import Matcher
import os
import re
spacy.prefer_gpu()
nlp = spacy.load("nl_core_news_lg")
matcher = Matcher(nlp.vocab)

In [11]:
months = ['januari', 'februari', 'maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', 'oktober', 'november', 'december']
days = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag', 'zondag']

In [24]:
base = r'..\\data\\covid wob text without ocr\\'

pattern = [{"LOWER" : {"IN" : days}, "OP" : "?"}, 
           {"IS_DIGIT": True}, 
           {"LOWER" : {"IN" : months}},
           {"IS_DIGIT": True, "OP" : "?"}]

matcher.add("Dates", [pattern])

with open('..\\data\\dates.txt', 'r', encoding='utf-8') as f:
    dates = f.read()
    dates = dates.split('\n\n')
    
for file in dates:
    lines = file.split('\n')
    
    with open(base + lines[0] +'.txt', 'r', encoding='utf-8') as f:
        text = f.read()
        text = re.sub(' +', ' ', text)
    
    doc = nlp(text)
    matches = matcher(doc)
    print(lines[0])
    
    previous = 0
    temp = []
    for match_id, start, end in matches:
        temp.append((start, end, doc[start:end].text))
        
    for i in range(len(temp)):
        if i + 1 == len(temp):
            print(temp[i][2])
            break
        
        if temp[i][0] == temp[i+1][0] or temp[i][1] == temp[i+1][1]:
            continue
        print(temp[i][2])
    

d6b6d7c3080a7e30e39e8be2451f35d8_besluit-op-uw-wob-verzoek-dd-16-december-2020-inzake-uitgezonderde-detailhandel-tijdens-lockdown.pdf
16 december 2020
16 december 2020
21 december 2020
15 december 2020
16 december 2020
13 januari 2020
18 januari 2020
21 januari
28 april 2021
11 februari 2021
d6b6d7c3080a7e30e39e8be2451f35d8_documenten-wob-detailhandel.pdf
zaterdag 12 december 2020
12 december 2020
zaterdag 12 december 2020
12 december 2020
maandag 14 december 2020
14 december 2020
16 december 2020
17 januari 2021
16 december 2020
17 januari 2021
16 december 2020
17 januari
20 januari 2021
18 januari 2021
8 december 2020
15 december 2020
19 januari 2021
27 november 2020
18 januari 2021
12 januari 2021
dinsdag 15 december 2020
15 december 2020
dinsdag 19 januari 2021
19 januari 2021
woensdag 16 december 2020
16 december 2020
zondag 17 januari 2021
17 januari 2021
15 december 2020
19 januari 2021
16 december 2020
19 januari 2021
17 januari 2021
8 december 2020
dinsdag 15 december 2020
15 

## Conclusion

One thing that can be very usefull is to check what proceeds the date. If it is something like "datum" or "verzonden" or something similar like the english versions of those, then you can be relatively sure that it is the date that the document was created or approved for release.


One part that inhibits the extraction of dates is the messiness of the data used. The data contains things like "soon as Feb- 27Supply for" for which the date should be Feb-24. To extract all of these kinds of dates, a lot of specialized extractors need to be made. Also you would need to first know for what you're specializing the extractors and for that you would need to manually scan the data. However this is what we're trying to prevent. Another example of this is "1 mei 20 20". This would also need a special case. These problems however are all the result of bad scanning of documents where the an extra space is added or where a space is removed. This might be (mostly) resolved with the optimal OCR settings.