# Example workflow

There are three distinct steps for the workflow;
    
1. extract text from docx, pdf, etc.
2. anonymize text
3. if applicable, obtain new content from a series of repeated log books

In [1]:
from eac_py import extract, anonymize, complement

import os
from pathlib import Path

## Data

We'll use a subset of the logbook data as an example. I've renamed and organized all the logbooks for group 1, and separated them from the main data.

In [2]:
files = os.listdir("../../data/example/raw")
files

['01.docx',
 '02.docx',
 '03.pdf',
 '04.pdf',
 '05.pdf',
 '06.pdf',
 '07.pdf',
 '08.pdf',
 '09.pdf',
 '10.pdf']

As you can see, the logbooks are a mix of Word (.docx) and PDF (.pdf) documents. As far as our pipeline is concerned, this should not matter.

In [3]:
# keep track of document status over time
previous = ""

# we have a list of known actors that might appear in the document
known_actors = ["Hamada Abou Zarad", "Mohammed", "Joyce Rops", "Joyce", "Naud de Adelhart Toorop", "Naud", "Jelle Room", "Jelle", "Marie Appelman", "Marie", "Pepijn Bakker", "Pepijn", "Luuk Baten", "Luuk"]

# define some regex patterns for common private data
student_number = r"[sSmMxX]\d{7}"   # s, m or x followed by 7 digits.
email_adress = r"(\w+\.)*\w+\d*@(student\.)?utwente\.nl" # *@[student.]utwente.nl

# loop over the log books
for i, file in enumerate(files, start=1):
    
    # obtain a standardized path to the file
    path = Path("../../data/example/raw/" + file)
    print(path)

    # extract text
    text = extract.extract_text(path)

    # get named entities from the cover page
    cover = extract.extract_text(path, 0)
    actors = anonymize.get_person_entities(cover)
    actors.extend(known_actors)

    # anonymize the text
    text = anonymize.anonymize(text, actors, [student_number, email_adress])

    # remove repeated materials
    new_content = complement.complement(previous, text)
    with open(f"../../data/example/clean/log-{i}.txt", "wb") as file:
        file.write(new_content.encode("utf8"))

    # keep track of the current document to compare the next document to
    previous = text



..\..\data\example\raw\01.docx
..\..\data\example\raw\01.docx -> ..\..\data\example\raw\01.tmp.pdf
100%|██████████| 1/1 [00:01<00:00,  1.44s/it]
..\..\data\example\raw\02.docx
..\..\data\example\raw\02.docx -> ..\..\data\example\raw\02.tmp.pdf
100%|██████████| 1/1 [00:02<00:00,  2.02s/it]
..\..\data\example\raw\03.pdf
..\..\data\example\raw\04.pdf
..\..\data\example\raw\05.pdf
..\..\data\example\raw\06.pdf
..\..\data\example\raw\07.pdf
..\..\data\example\raw\08.pdf
..\..\data\example\raw\09.pdf
..\..\data\example\raw\10.pdf
