# <font color='#0f2f5c'>The Linguistic Locator: The marriage between Natural Language Processing and paleography.</font>

### <font color='#0f2f5c'>Table of Contents</font>

* [Research](#Research)
    * [Stengths and Weaknesses](#Research_SaW)
* [Data Gathering](#DataGathering)
* [General Use Methods](#GUM)
* [Data Preprocessing](#DataPrep)
* [Model Testing](#Testing)

### As we go further and further back in history, it gets harder and harder to pin down when and where a certain piece of writing comes from. Generally, writing gets dated by looking at the handwriting and general context of history the text fits in. A text could also be dated by looking at the material it was written on itself. These fields of analysis are called paleography and codicology respectively.
### This work still gets done by humans, but it might be possible to predict these attributes computationally. By pondering this, we arrive at the main question: What models of AI can accurately predict attributes of historical writings?

### In order to answer this question, the following sub questions should be answered:
1. [How accurate does the model need to be?](#SubQuestion1)
2. [What models could possibly work for this application?](#SubQuestion2)
3. [What criteria do the writings need to meet?](#SubQuestion3)
4. [What attributes can we predict (accurately)?](#SubQuestion4)
5. [What model works best?](#SubQuestion5)

# <font color='#0f2f5c'>Research</font><a class="anchor" id="Research"></a>
## <font color='#0f2f5c'>How accurate does the model need to be?</font><a class="anchor" id="SubQuestion1"></a>

### Before we can get to designing the project, we first need to do some research and set some targets to test our models with further into the project. First off, we need to define some targets regarding the accuracy of the model in terms of time and location.
### Let's start with the former. From Brent Nongbri's 'The Use and Abuse of P<sup>52</sup>: Papyrological Pitfalls in the Dating of the Fourth Gospel', we can gather that "a period of 50 years is the least acceptable spread of time", and that we "should probably be to avoid dating a hand more precisely than a range of at least seventy or eighty years". As another example, according to Bruce Griffith's 'The Paleographical Dating of P-46': "it is difficult to construct a 95% confidence interval for NT \[New Testament] manuscripts without allowing a century for an assigned date".
### With this knowledge in mind, let's set the accuracy target regarding time to 40 for now. So if the model's prediction of the year of writing of a certain text is off by more than 40 years, we deem it as inaccurate. The problem with this method, however, is that we are clamping a continuous rating to a binary state. If two texts got dated 40 and 41 years off respectively, only the former would get seen as accurate whereas a third text that has been dated only 1 year off would be seen as just as accurate as the first. That doesn't make a lot of sense. We might want to change this to something like a formula to judge accuracy in the future.

### But what about location? As of writing this, unfortunately I haven't been able to find any relevant information regarding pinning down the location of writing - or localizing - of a text.

## <font color='#0f2f5c'>Strengths and Weaknesses</font><a class="anchor" id="Research_SaW"></a>

## <font color='#0f2f5c'>What models could possibly work for this application?</font><a class="anchor" id="SubQuestion2"></a>



# <font color='#0f2f5c'>Data Gathering</font><a class="anchor" id="DataGathering"></a>
## <font color='#0f2f5c'>What criteria do the writings need to meet?</font><a class="anchor" id="SubQuestion3"></a>

### To test this, some data should be gathered to train a potential model on. In order to do this, some criteria need to be set up and the scope of this project should be defined.
### To start off, let's only train on texts from 1200 until 1950. 1200 is an arbitrary pick. Going earlier would get progressively harder and harder to find proper examples for that have been dated already. 1900 works as a proper endpoint, because going any more recent slowly moves us from historical to contemporary writings.

### The first version of this project will only train on the following languages:
1. Dutch
2. German
3. English
4. French
5. Italian

### Per language, we should collect 3 books, at minimum. The more, the merrier, of course. Some of these should be close to each other in time or place of writing, as to avoid the model just learning to detect writing styles of specific authors.

### Now, as for the texts themselves: what rules should they meet?
1. The time and location of writing should be known or easily extrapolated.
2. The text shouldn't be modernized.
3. Most of the paragraphs of the book should be at least 500 characters.

### Because we are going to split the texts to train on their paragraphs, the third rule is to filter out short results. As per reference, the following excerpt from Locke's 'An Essay Concerning Humane Understanding Vol II' is 521 characters long.
    "As the ideas men's words stand for are of different sorts, so the way of
    making known the ideas they stand for, when there is occasion, is also
    different. For though DEFINING be thought the proper way to make known
    the proper signification of words; yet there are some words that will
    not be defined, as there are others whose precise meaning cannot be made
    known but by definition: and perhaps a third, which partake somewhat of
    both the other, as we shall see in the names of simple ideas, modes, and
    substances."

# <font color='#0f2f5c'>General Use Methods</font><a class="anchor" id="GUM"></a>
### Before we can dive into the code, we need to define some general functions and methods.

In [1]:
def CheckFile(function):  # Requires the first argument (no keyword) to be  the filepath (Path).
	def wrapper(*args, **kwargs):
		file = args[0]

		if file.exists():
			if file.is_file:
				output = function(*args, **kwargs)

				return output
			else:
				print(f"ERROR: '{file}' is not a file.")

			output = function(*args, **kwargs)
			return output
		else:
			print(f"ERROR: '{file}'not found.")

	return wrapper

# <font color='#0f2f5c'>Data Preprocessing</font><a class="anchor" id="DataPrep"></a>
### Now that we have our data, let's first split up the entire books into paragraphs for training. Most texts can easily just be split by whitelines - so just 2 newlines. Some texts, however, can be split in a way to produce better samples using a custom separator. For this instance then, a defaultdict works perfectly.

In [2]:
from collections import defaultdict

SEPARATORS = defaultdict(lambda: r"\n\n")
SEPARATORS['An Essay Concerning Humane Understanding Vol II - Locke.txt'] = r"\n\n[0-9]\."  # Some sources have better results with custom separators.

### Now that we have our data, let's first split up the entire books into paragraphs for training. Most texts can easily just be split by whitelines - so just 2 newlines. Some texts, however, can be split in a way to produce better samples using a custom separator. For this instance then, a defaultdict works perfectly.

In [4]:
from os import listdir
from pathlib import Path
import re

import pandas as pd

# Only keeping letters, numbers and spaces. Also removing empty spaces before the first letter. Based on this filter:
# https://stackoverflow.com/questions/24676691/whats-a-good-regex-to-include-accented-characters-in-a-simple-way
CHARACTERFILTER = r"^( +)|[^-'\"a-zA-ZÀ-ÖØ-öø-ÿ0-9 ]"  # Could've perhaps used string.punctuation for this... Oh well.

splitText = dict()  # Title: [paragraph, paragraph, ...]


@CheckFile
def SplitText(file: Path) -> list:
	"""Splits the input text and returns a list of its paragraphs with all punctuation removed."""

	global SEPARATORS, CHARACTERFILTER

	with open(file.resolve(), 'r', encoding='utf-8') as textFile:  # UTF-8 saves the day, regarding 'nonconventional' characters from different languages.
		text = textFile.read().lower()

		segments = re.split(SEPARATORS[textFile.name], text)
		segments = [s.replace('\n', ' ') for s in segments]
		segments = [re.sub(CHARACTERFILTER, '', s) for s in segments]
	return segments


def LinkData(textData: dict, metaData: pd.DataFrame) -> pd.DataFrame:
	"""Merges the newly split paragraph data and metadata regarding the texts in order to output a dataframe of paragraphs, each labeled with its metadata."""

	linkedData = pd.DataFrame(columns=['Title', 'Text', 'Language', 'Year', 'Longitude', 'Latitude'])
	for title, textList in textData.items():
		title, language, year, longitude, latitude = metaData.loc[metaData['Title'] == title].squeeze().tolist()  # Not using the title.

		for paragraph in textList:
			linkedData.loc[len(linkedData.index)] = [title, paragraph, language, year, longitude, latitude]
	return linkedData


def GetData(textFiles: list, csv: pd.DataFrame):
	"""Retrieves the text data and it's metadata in order to return a merged cleaned dataset of paragraphs and addditional information."""

	global splitText

	for file in textFiles:
		data = SplitText(file)
		data.sort(key=len)

		data = list(filter(lambda x: len(x.strip()) > 500, data))  # Elaborate on size <--> amount balance.

		splitText[file.name] = data
	data = LinkData(splitText, csv)

	return data

files = [Path(f'D:\\Documents\\Fontys\\S4\\Language\\Data\\Raw\\{file}') for file in listdir('D:\\Documents\\Fontys\\S4\\Language\\Data\\Raw')]

data = GetData(files, pd.read_csv('D:\\Documents\\Fontys\\S4\\Language\\Data\\Done\\Metadata.csv', encoding='unicode_escape'))
print(data.head())

                                               Title  \
0  An Essay Concerning Humane Understanding Vol I...   
1  An Essay Concerning Humane Understanding Vol I...   
2  An Essay Concerning Humane Understanding Vol I...   
3  An Essay Concerning Humane Understanding Vol I...   
4  An Essay Concerning Humane Understanding Vol I...   

                                                Text Language  Year  \
0  to conclude this is that which in short i woul...  English  1690   
1  as the ideas men's words stand for are of diff...  English  1690   
2  in all things therefore where we have clear ev...  English  1690   
3  the names of mixed modes being general they st...  English  1690   
4  first essence may be taken for the very being ...  English  1690   

   Longitude Latitude  
0     51.756    0.212  
1     51.756    0.212  
2     51.756    0.212  
3     51.756    0.212  
4     51.756    0.212  
