## Problem statements

For this extractor, I want to retrieve all mentions of miniseries in the WOB documents. There are a lot of different ministeries to extract all with their own abbriviations, so in this notebook I will explore to what extend it is possible to extract this.

## Approach

I decided to use a different approach as to what I did with the dates extractor. For the dates I mostly used a rulebase system. For this I will use gazetteers because there aren't a lot of ministeries to extract. Writing them down is not a long process and this will likely give better results than a rulebased system. The rulebased system might have to make sacrifices in accuracy in one area to get better results in others or you would have to make special cases for all ministeries and in that case you're basically using gazetteers.

### The documents
- [Document 1](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/0068ed0b40cca6270f857d2614cc63c0_besluit.pdf.text)
- [Document 2](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/0068ed0b40cca6270f857d2614cc63c0_document.pdf.text)
- [Document 3](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/0335b3f498dbbd7c537ad23abe8c08dc_deelbesluit-1-wob-verzoek-dd-11-augustus-2021-inzake-het-europees-herstelfonds.pdf.text)
- [Document 4](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/07e2b274045cb5b4f54371a3c905cae9_wobverzoek-mccb-catshuis.pdf.text)
- [Document 5](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/17967f10340f6de2a79ba984209b4a2c_besluit.pdf.text)
- [Document 6](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/2c4079ccaad78e8d2cb494681ae79928_stukken-bij-besluit-wob-verzoek-notities-besluitvorming-coronacrisis-20.pdf.text)
- [Document 7](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/3ee644b02189ec45794dd998ac5da5b5_besluit.pdf.text)
- [Document 8](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/40f5564f839324b9af20c295dd261007_inventarislijst-eerste-deelbesluit.pdf.text)
- [Document 9](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/41fedde7ea03bdd9ac5cd92c7f7cfd43_documenten.pdf.text)
- [Document 10](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/439285231fa523a74c430da4a2704fab_deels-openbare-documenten.pdf.text)
- [Document 11](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/4991dc6aed6e1369f58e6bc1996d5e2d_paginas-van-samengevoegd-document-850-paginas-met-uitgestelde-verstrekking-deel-1.pdf.text)
- [Document 12](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/4991dc6aed6e1369f58e6bc1996d5e2d_paginas-van-samengevoegd-document-850-paginas-met-uitgestelde-verstrekking-deel-2.pdf.text)
- [Document 13](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/4991dc6aed6e1369f58e6bc1996d5e2d_paginas-van-samengevoegd-document-850-paginas-met-uitgestelde-verstrekking-deel-3.pdf.text)
- [Document 14](https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/4991dc6aed6e1369f58e6bc1996d5e2d_paginas-van-samengevoegd-document-850-paginas-met-uitgestelde-verstrekking-deel-4.pdf.text)

As to the selection of documents, some of the documents didn't have a single mention of a ministery so I excluded those. I went through a number of documents until I had enough examples. The results can be found in [This text file](https://github.com/JustinBon/thesis/blob/main/data/ministeries.txt)

In [6]:
import spacy
from spacy.matcher import Matcher
import os
import re
from datetime import datetime
import requests
from bs4 import BeautifulSoup
spacy.prefer_gpu()

False

In [7]:
baselink = '(https://github.com/JustinBon/thesis/blob/main/data/covid%20wob%20text%20without%20ocr/'

with open('..\\data\\ministeries.txt', 'r', encoding='utf-8') as f:
    m = f.read()
    m = m.split('\n\n')
    
counter = 0
nfile = 0
for file in m:
    nfile += 1
    file = file.split('\n')
#     print(f'- [Document {nfile}]' + baselink + file[0] + '.text)')
    counter += len(file) -1

    


To get a list of all of the ministeries, I took [The list that wikipedia provides](https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_ministeries). This has a list of all ministeries of the current government and all the ministeries that do not exist anymore which is very convinient. Also, wikipedia provides the links to the individual wiki pages of the ministeries as well so I can imidiatly link to them too.  I used beautifulsoup to get this data.



CORRECTION:

After testing, it is way better for the accuracy of the matcher to only look at the current ministeries.

In [94]:
# gets labeled data
# also used for getting names of documents used

def getData():
    with open('..\\data\\ministeries.txt', 'r', encoding='utf-8') as f:
        m = f.read()
        m = m.split('\n\n')
    return m

In [177]:
# get list of ministeries from wikipedia

def getMinisteries():
    page = requests.get('https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_ministeries')
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find_all("td")[-1]

    results.find_all('a', href = True)
    wikis = {}

    abrr = []
    for item in str(results.find_all('p')[0]).split('\n')[:-1] + str(results.find_all('p')[1]).split('\n')[1:-1]:
        temp = re.findall('(?<=\()(.*?)(?=\))', item)
        if temp == []:
            abrr.append(None)
        elif temp[-1] == 'Nederland':
            abrr.append(None)
            if 'Overzeese Gebiedsdelen' in item:
                abrr.append(None)
        else:
            abrr.append(temp[-1].replace('&amp;', '&'))


    counter = 0
    for ministerie in results.find_all('a')[:12]:
        wikis[ministerie.text] = {'Link': 'https://nl.wikipedia.org' + ministerie['href'], 'Abbriviation' : abrr[counter]}
        counter += 1

    return wikis


In [129]:
# load hand-labeled data

def getLabeledData():
    m = getData()
        
    labeledMinisteries = {}
    for file in m:
        lines = file.split('\n')
        
        labeledMinisteries[lines[0]] = {}
        
        for line in lines[1:]:
            line = line.lower()
            if line in labeledMinisteries[lines[0]]:
                labeledMinisteries[lines[0]][line] += 1
            else:
                labeledMinisteries[lines[0]][line] = 1


    return labeledMinisteries


In [180]:
# find ministeries in the text

def findMinisteries():
    ministeries = getMinisteries()
    
    abrr = [ministeries[x]['Abbriviation'] for x in ministeries if ministeries[x]['Abbriviation'] != None]
    
    allMinisteries = list(ministeries.keys()) + abrr
    
    
    found = {}
    
    for file in getData():
        
        fileName = file.split('\n')[0]
        found[fileName] = {}
        with open('..\\data\\covid wob text without ocr\\' + fileName +'.txt', 'r', encoding='utf-8') as f:
            text = f.read()
            text = re.sub(' +', ' ', text)
            text = text.lower()
        
        for ministerie in allMinisteries:
            temp = re.findall('ministerie van ' + ministerie.lower(), text)
            if temp != []:
                found[fileName][ministerie.lower()] = len(temp)
    return found

In [166]:
# show with which ministeries i made mistakes
# not used currently

def showError(error, found, labeled):
    
    for err in error:
        f = 0
        l = 0
        for file in found:
            try:
                f += found[file][err]
            except:
                f += 0
            
            try:
                l += labeled[file][err]
            except:
                l += 0
        
        print(f'ministerie: {err}, matches: {f}, labeled: {l}')

In [181]:
# calculate preformance

def preformance():
    labeled = getLabeledData()
    found = findMinisteries()
    
    tp = 0
    fp = 0
    fn = 0
    fpList = []
    fnList = []
    
    
    for file in found:
        
        for ministerie in found[file]:
            if ministerie not in labeled[file]:
                fp += found[file][ministerie]
                fpList.append(ministerie)
                
            elif found[file][ministerie] == labeled[file][ministerie]:
                tp += found[file][ministerie]
                
            elif found[file][ministerie] > labeled[file][ministerie]:
                tp += labeled[file][ministerie]
                fp += found[file][ministerie] - labeled[file][ministerie]
                fpList.append(ministerie)
                
            elif found[file][ministerie] < labeled[file][ministerie]:
                tp += found[file][ministerie]
                fn += labeled[file][ministerie] - found[file][ministerie]
                fnList.append(ministerie)
    
        for ministerie in labeled[file]:
            if ministerie not in found:
                fn += labeled[file][ministerie]
                fnList.append(ministerie)
                
    

    print(tp, fp, fn)
    print('recall', tp / (tp + fn))
    print('precision', tp / (tp + fp))
    print('')
    
preformance()

75 10 124
recall 0.3768844221105528
precision 0.8823529411764706



## Conclusion

It is easy to find correct matches for ministeries when using gazetteers. The precision of this test is 0.882. Most all of the matches that were found are actually ministeries. However the recall of the matcher is very bad at 0.377. The matcher doesnt even find half of all the ministeries that were in the text. The reason for this is clear. It has to do with summations as abbriviations. 

Within the texts, a lot of summations of ministeries are used. For example: "de ministeries van VWSS, JenV, en BZK". In this case none of these will abbreviations of ministeries will be matched. The matcher looks for "ministerie van {name of ministerie}". VWSS will not be matched because ministeries is plural in stead of singular and JenV and BZK will not be matched because these are not preceded by "ministerie van ". A solution for this could be to not look for the preceding "ministerie van ", however, if this is done, all of those abbriviations will be matched outside of context. The abbriviation of the ministery of defence for example, is "def" so the matcher will find all occurences of the three letters "def". With this modification the recall will actually increase to a whopping 0.458 but the precision will decrease to a meager 0.122. So this is not a solution. 

I also tested if the it would help to at least find the VWSS in the example by also looking at the plural of ministerie but that decreased the precision more than it increased the recall. 

The last solution is the "best". This completely ignores the abbriviations and just looks for the names if the ministeries without the preceding "ministerie van ". This actually increases the recall to 0.435 but it also decreases the precision to 0.732. This is because it will also find random mentions of the words "financien" and "defensie" and others. I don't think that that is worth it. In this case I would rather have more confidance in what I what I find to be correct than finding everything.