## Idea behind splitting the data

There is one big text file containing 200 documents.

All these documents contain (at the very least):
* title
* publish date
* body
* document delimiter
* some garbage text such as hyperlinks

First split the large file into 200 individual documents based on the delimiter.<br>
Preprocess and remove most of the garbage text.<br>
Save the individual documents.<br>

Aggregate the documents by year:
* find title
* find body
* find publish date

Create a file for each year of the publish dates.
Add to these files:
* document title
* document body<br>
only if the publish date's year matches the filename.


*You can skip the code below (but do run it for initialisation purposes).*

In [17]:
# Some initialisation code.

# Load necessary modules.
import os
import sys
import re

# Set the folder in which individual files will be stored.
DOCUMENTS_FOLDER = "./documents"


class Date:
    def __init__(self, day, month, year):
        self.day = day
        self.month = month
        self.year = year

    def __repr__(self):
        return "{} {} {}".format(self.day, self.month, self.year)

    def isValid(self):
        return self.day != -1 and self.month != -1 and self.year != -1

def createDate(dateString):
    """A dateString is of the following format:
        dd month yyyy
    Return a Date that has a selectable day, month and year.
    """
    try:
        dateString = dateString.replace(",", "")
        dateStringSplit = dateString.split(" ")
        if dateStringSplit[0].isdigit():
            day = dateStringSplit[0]
            month = dateStringSplit[1]
        else:
            day = dateStringSplit[1]
            month = dateStringSplit[0]
        year = dateStringSplit[2]
        return Date(day, month, year)
    except IndexError:
        return Date(-1, -1, -1)
    
    
class Document:
    def __init__(self, title, date, body):
        self.title = title
        self.date = date
        self.body = body

    def __repr__(self):
        MAX_LENGTH = 100
        body = self.body if len(self.body) < MAX_LENGTH else self.body[:MAX_LENGTH]
        return "{}, {}: {}".format(self.title, self.date, body)

**The interesting code starts below.**

## Split dataset in individual documents

In [18]:
# Read the downloaded dataset file and store in variable 'lines'.
lines = []
with open('dataset3.txt', 'r') as f:
    lines = f.read()
    
print("First 500 characters of the data:\n{}".format(lines[:500]))

First 500 characters of the data:
ï»¿
                               1 of 200 DOCUMENTS

                                   Het Parool

                               December 30, 1995

Ex-vrouw Carlos wil getuigen in proces

SECTION: Pg. 7

LENGTH: 333 words


FRANKFURT - De ex-vrouw van de beruchte terrorist Ilich Sanchez of 'Carlos de
Jakhals', de Duitse Magdalena Kopp, zou bereid zijn tegen haar voormalige
echtgenoot te getuigen in het proces dat wordt voorbereid in Frankrijk. Kopp,
die ervan wordt verdacht dat zij jarenlang


In [19]:
# newDocumentExpression matches text of the form: "123 of 200 DOCUMENTS".
# \d matches digits [0-9].
# For more regular expressions, see RegularExpressions-Example.ipynb
newDocumentExpression = r"\b\d+\b of \b\d+\b DOCUMENTS"

In [20]:
# Split the document in several documents based on the newDocumentExpression.
# Skip the first element (index 0) as it contains all text BEFORE the first document.
# Example:
#     Text before Article
#     1 of 200 DOCUMENTS
#     Title of Article
#     Body of Article
documents = re.split(newDocumentExpression, lines)[1:]

In [21]:
# Check if all documents are found.
# The length (amount) of list of documents should equal the amount of documents.
print("The amount of documents found is: {}".format(len(documents)))

The amount of documents found is: 200


### Preprocess and remove garbage text

In [22]:
# Remove the leading/trailing whitespace from the documents.
documents = [document.strip() for document in documents]

In [23]:
# Remove all hyperlink-like text from the documents.
hyperlinkExpression = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b(([-a-zA-Z0-9@:%_\+.~#?&//=]|\s*-)*)"
documents = [re.sub(hyperlinkExpression, '', document) for document in documents]

**Note on the above code**

Removes hyperlinks of the form:<br>
http://www.standaard.be/cnt/dmf20161129_02597674

but also:<br>
https://www.rijksoverheid.nl/actueel/nieuws/2016/11/03/minister-president<br>
[WHITESPACE]-rutte-verzorgt-de-preek-van-de-leek

but only the first line of: (last three lines remain as garbage text)<br>
http://www.hln.be/hln/nl/4125/Internet/article/detail/2972254/2016/11/10/<br>
[WHITESPACE]President<br>
[WHITESPACE]-Trump-mag-twittervolgers-van-POTUS-houden.dhtml?utm_medium=rss&utm_content=ihln<br>
[WHITESPACE]ophlnbehetallerlaatstenieuwsoverinternetgames

If you can do any better, let me know :)

In [29]:
def removeTags(document):
    forbiddenTags = ["SECTION", "BYLINE", "LOAD-DATE", "LANGUAGE", "PUB-TYPE", "DATELINE"]
    documentSplit = document.split('\n')
    documentSplit = list(filter(None, documentSplit))
    cleanDocument = [line for line in documentSplit if (not "SECTION" in line and not "BYLINE" in line and not "LOAD-DATE" in line and not "LANGUAGE" in line and not "PUB-TYPE" in line and not "DATELINE" in line)]
    cleanDocument = "\n".join(cleanDocument)
    return cleanDocument

documents = [removeTags(document) for document in documents]
print(documents[:5])

["Het Parool\n                               December 30, 1995\nEx-vrouw Carlos wil getuigen in proces\nSECTION: Pg. 7\nLENGTH: 333 words\nFRANKFURT - De ex-vrouw van de beruchte terrorist Ilich Sanchez of 'Carlos de\nJakhals', de Duitse Magdalena Kopp, zou bereid zijn tegen haar voormalige\nechtgenoot te getuigen in het proces dat wordt voorbereid in Frankrijk. Kopp,\ndie ervan wordt verdacht dat zij jarenlang lid was van de terreurgroep\nRevolutionaire Cellen, is volgens het weekblad Der Spiegel onlangs uit Venezuela\nteruggekeerd naar haar geboortestad Ulm.\nCarlos zit al anderhalf jaar in voorarrest in Parijs, nadat hij door de Franse\ngeheime dienst was opgehaald in Soedan, waar hij was gearresteerd nadat alle\nwesterse geheime diensten twintig jaar naar hem op zoek waren geweest. Kopp brak\nvier jaar geleden met hem. Zij hebben een dochtertje, Rosa, dat bij haar moeder\nverblijft.\nMagdalena Kopp is inmiddels in Duitsland verhoord door de politie in verband met\neen bomaanslag te

### Saving the individual documents

In [None]:
# Create the folder where individual documents are stored.
if not os.path.exists(DOCUMENTS_FOLDER):
    os.makedirs(DOCUMENTS_FOLDER)

In [None]:
# Write the documents to individual files as '[number].txt'
for document in documents:
    index = documents.index(document) + 1
    with open('{}/{}.txt'.format(DOCUMENTS_FOLDER, index), 'w+') as writeFile:
        writeFile.write(document)

## Aggregating the documents by year

In [None]:
# Extract the titles of the documents.
titles = []
for document in documents:
    documentSplit = document.split('\n')
    documentSplit = list(filter(None, documentSplit))

    # Find the part that says 'LENGTH:', because...
    lengthItem = next((s for s in documentSplit if 'LENGTH:' in s), None)
    lengthIndex = documentSplit.index(lengthItem)
    # ...the text is stored in the string one before the one that says 'LENGTH: xxx woorden'
    title = documentSplit[lengthIndex-1]
    titles.append(title)
    
print("The first five titles are:\n{}".format(titles[:5]))

In [None]:
# Extract the bodies of the documents.
bodies = []
for document in documents:
    documentSplit = document.split('\n')
    documentSplit = list(filter(None, documentSplit))

     # Find the part that says 'LENGTH:', because...
    lengthItem = next((s for s in documentSplit if 'LENGTH:' in s), None)
    lengthIndex = documentSplit.index(lengthItem)
    # ...the text is stored in the strings one after the one that says 'LENGTH: xxx woorden'
    text = ' '.join(documentSplit[lengthIndex+1:])
    bodies.append(text)
    
print("The first five bodies are:\n{}".format(bodies[:5]))

### Finding the distribution dates

In [None]:
# Create lists of all possible months to check for.
MONTHS_DUTCH = ["januari", "februari", "maart", "april", "mei", "juni", "juli", "augustus", "september", "oktober", "november", "december"]
MONTHS_ENGLISH = ["january", "february", "march", "april", "may", "june", "july", "august", "september", "october", "november", "december"]
# Also create the capitalised versions of these months.
MONTHS_DUTCH_CAPITAL = [month.capitalize() for month in MONTHS_DUTCH]
MONTHS_ENGLISH_CAPITAL = [month.capitalize() for month in MONTHS_ENGLISH]

# The possible months are all the months defined above.
# Note that MONTHS_DUTCH and MONTHS_ENGLISH_CAPITAL should suffice,
# but this isn't that much work and catches any human errors.
# Turning into a 'set' first means that all copies will be reduced to simply one:
#     ["april", "april", "mei", "may"] -> ["april", "mei", "may"]
MONTHS = list(set(MONTHS_DUTCH + MONTHS_DUTCH_CAPITAL + MONTHS_ENGLISH + MONTHS_ENGLISH_CAPITAL))

# Add a regular subexpression for each month to the general regular expression.
# The final regular expression will match all kinds of variations using all months
# in different positions.
reMonth = []
for month in MONTHS:
    # Expression for dd mm yyyy
    reMonth.append(r"\d\d* {} \d\d\d\d".format(month))
    # Expression for mm dd yyyy
    reMonth.append(r"{} \d\d* \d\d\d\d".format(month))
    # Expression for dd mm, yyyy
    reMonth.append(r"\d\d* {}, \d\d\d\d".format(month))
    # Expression for mm dd, yyyy
    reMonth.append(r"{} \d\d*, \d\d\d\d".format(month))
dateExpression = "(" + '|'.join(reMonth) + ")"

print("Final date regular expression:\n{}".format(dateExpression))

In [None]:
distributionDates = []
for document in documents:
    allDates = re.findall(dateExpression, document)
    # Select the first date in the article and assume it is the distribution date.
    # This is usually correct, as the documents start with a title and date.
    try:
        distributionDate = allDates[0]
    except IndexError:
        distributionDate = ""
    distributionDates.append(distributionDate)

print("The first five distribution dates are:\n{}".format(distributionDates[:5]))

### Creating and saving the aggregated documents

In [None]:
# Create document objects for each document (for easier access to title, body and date).
documents = [Document(titles[documents.index(document)], createDate(distributionDates[documents.index(document)]), bodies[documents.index(document)]) for document in documents]
# Remove documents with an invalid date, these cannot be labelled properly.
documents = [document for document in documents if document.date.isValid()]

In [None]:
# Create a new folder for the per_year data.
perYearRoot = 'per_year'
if not os.path.exists(perYearRoot):
    os.makedirs(perYearRoot)

# Use the documents' dates to create yearly documents containing the title and
# body of each article.
# If the file of a year already exists, simply append to this document, thereby
# aggregating the documents by year.
for document in documents:
    with open("{}/{}.txt".format(perYearRoot, document.date.year), "a+") as outputFile:
        outputFile.write(document.title + "\n\n")
        outputFile.write(document.body + "\n\n\n")

We are done with preprocessing.

The `documents` and `per_year` folder now contain documents that can be processed with for example NLTK.

See `NLTK-Example.ipynb`.