# GEOPARSING HISTORICAL DOCUMENTS

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) and [K. McDonough](https://www.turing.ac.uk/people/researchers/katherine-mcdonough) as part of the [GéoDISCO](https://www.msh-lse.fr/projets/geodisco/) (2019-2020) and GEODE (2020-2024) projects.


[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GEODE-project/perdido-geoparsing-notebook/master?filepath=GeoparsingEncyclopedie.ipynb)

## Overview

In this tutorial, we'll learn about a few different things.

- How to load data from TEI-XML files into a Python dataframe
- Use Python dataframe for simple data analysis
- Test the [PERDIDO API](http://erig.univ-pau.fr/PERDIDO/api.jsp) for preprocessing French texts (part-of-speech tagging)
- Test the [PERDIDO API](http://erig.univ-pau.fr/PERDIDO/api.jsp) for geoparsing (geotagging + geocoding) Encyclopedie articles
- Display custom geotagging results (PERDIDO TEI-XML) with the [displaCy Named Entity Visualizer](https://spacy.io/usage/visualizers)
- Display geocoding results on a map

## Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding consists to find unambiguous geographic coordinates.


### The PERDIDO Geoparser API 

The [PERDIDO API](http://erig.univ-pau.fr/PERDIDO/) has been developped for extracting and retrieving displacements from unstructured texts. It has initially been developed for French, Spanish and Italian hiking descriptions.

More recently, as part of the [GéoDISCO project](https://www.msh-lse.fr/projets/geodisco/) we have developed a custom version for historical documents and more specifically for the Encyclopédie.


In this tutorial we'll see how to use the PERDIDO API for preprocessing and geoparsing French texts. 
We will apply geoparsing on the Encyclopedie corpus version released by the [ARTFL project](https://encyclopedie.uchicago.edu/) and we'll show the limit of geotagging and geocoding historical documents.

### Acknowledgement

Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu/), University of Chicago.


## Getting started


You need to register on the PERDIDO website to get your API key: http://erig.univ-pau.fr/PERDIDO/api.jsp

In [None]:
# import of python libraries

import requests
import lxml.etree as etree
import xml.dom.minidom as xml
import os
from zipfile import ZipFile

import pandas as pd

from spacy.tokens import Span
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy import displacy

from display_xml import XML  

import geojson
import folium
from IPython.display import display


## 1. Loading the data

Here we assume that we have access to a directory with the corpus of documents. 
In our case, documents are XML-TEI files.

In [None]:
path = './data/' # path of the directory containing the corpus of documents

# select one document for testing
file = 'volume09-3630.tei' # Lyon: https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/9/3630/

# get the XML-TEI content of the document
root = etree.parse(path + file, etree.XMLParser(remove_blank_text=True)).getroot()

# print the XML-TEI content
print(xml.parseString(etree.tostring(root)).toprettyxml(indent='  ')) 

### 1.1 Extracting metadata and content from XML-TEI

In the following cell, we define a function for parsing and extracting metadata and text content from an XML-TEI file.
In this example, we only extract from the metadata the normclass (classification of the article, e.g. 'Géographie'), the head (head word of the article), and the author of the article. Then, we also extract the textual content as raw text.

In [None]:
def getDataFromEDDATEI(file_path, filename):
    file_id = filename[:-4]
    d = []
    try:
        volume = filename[6:8] 
        number = filename[9:-4] 
        head = ''
        normClass = ''
        author = ''
        txtContent = ''
        root = etree.parse(file_path+filename).getroot()
        div1 = root.find('./text/body/div1')
        if len(div1):
            for elt in div1:
                if elt.tag == 'p':
                    txtContent += ''.join(elt.itertext())
                    txtContent = txtContent.replace('\n', ' ').strip()
                elif elt.tag == 'index':
                    if elt.get('type') == 'normclass':
                        normClass = elt.get('value')
                    if elt.get('type') == 'head':
                        head = elt.get('value')
                    if elt.get('type') == 'author':
                        author = elt.get('value')
        d = [filename, volume, number, head, normClass, author, txtContent]
    except etree.XMLSyntaxError as e:
        pass
        #print(filename + ': ' + str(e))
    return d

This function returns a list object containing the filename, number and volume, the head word, the class, the author, and the textual content.
Let see what is the result of this function for the article about Lyon.

In [None]:
getDataFromEDDATEI(path, file)

In order to easily analyse and use these data we will now load these information about all the documents in our directory into a [Python dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):

In [None]:
data = []
for doc in os.listdir(path):
    if doc[-4:] == '.tei':
        data.append(getDataFromEDDATEI(path, doc))
df = pd.DataFrame(data, columns=['filename', 'volume', 'number', 'head', 'normClass', 'author', 'txtContent'])
df = df.dropna()
df = df.sort_values(['volume', 'number']).reset_index(drop = True)

df.head(10) # show the 10 first rows of the dataframe
#df.tail(10) # show the 10 last rows of the dataframe

Now we have access to all the attributs and methods of the dataframe object. For instance, we can easily print the number of rows in our dataframe which correspond to the number of articles in our corpus:

In [None]:
n = df.shape[0]
print('There are ' + str(n) + ' articles in the input directory')

### 1.2 First look at the data

Now that the data from the XML-TEI files are loaded into a python dataframe, we can have a look at them.
For instance, we can select articles based on their classification in the Encyclopedie.
If we want all articles in 'geography' we can just do as follows: 

In [None]:
# create the list of class that refers to 'Géographie'
normclassGEO = ['Géographie', 'Géographie moderne',
                 'Géographie ancienne', 'Géographie moderne | Géographie ancienne',
                 'Géographie ancienne | Géographie moderne', 'Géographie sacrée', 'Géographie sainte',
                 'Géographie | Histoire ancienne', 'Géographie historique', 'Géographie | Histoire',
                 'Histoire | Géographie', 'Géographie | Histoire naturelle', 'Géographie | Mythologie',
                 'Géographie ancienne | Mythologie', 'Histoire moderne | Géographie',
                 'Géographie ancienne | Géographie sainte', 'Géographie ancienne | Géographie sacrée',
                 'Géographie sacrée | Géographie ancienne', 'Géographie du moyen âge', 'Géographie des Arabes',
                 'Géographie | Commerce', 'Histoire | Géographie ancienne',
                 'Géographie | Histoire ancienne | Histoire moderne', 'Géographie ancienne | Littérature | Histoire',
                 'Histoire naturelle | Géographie', 'Géographie | Histoire ancienne | Mythologie',
                 'Géographie moderne | Commerce', 'Géographie ancienne | Géographie antique',
                 'Géographie moderne | Histoire', 'Géographie | Histoire monastique',
                 'Géographie ancienne | Géographie moderne | Mythologie', 'Géographie ancienne | Histoire',
                 'Géographie ancienne | Littérature | Mythologie', 'Géographie ancienne | Médailles'
                 ]

# query the dataframe for all articles matching one of the class in our list
df_geo = df.loc[df['normClass'].isin(normclassGEO)]
df_geo.head(10)

In [None]:
print('There are ' + str(df_geo.shape[0]) + ' geography articles')

Then, we can also make a query based on the value of the data. For instance, we can query all the articles of a specific author:

In [None]:
val = 'Jaucourt'
n = df_geo.loc[df['author'] == val].shape[0]
print(str(n) + ' were written by '+ val)

We can also easily show the number of articles per author

In [None]:
df_geo.groupby(['author'])["filename"].count()

It is possible to show the value of one of the column of our dataframe for a specific row (i.e., article) based on its name. For instance, if we want to know who wrote the article about Lyon or if we want to see its content:

In [None]:
df.loc[df['head'] == 'LYON'].author.item()

In [None]:
df.loc[df['head'] == 'LYON'].txtContent.item()

We can also perform a keyword search over the text content of all articles:

In [None]:
val = 'france'
df_2 = df[df['txtContent'].str.contains(val, case=False)]
print(str(df_2.shape[0]) + ' articles contain the word \''+ val + '\'')

Another example with the expression "ville de" will extract all articles that contain the expression 'ville de':

In [None]:
df[df['txtContent'].str.contains("ville de", case=False)]

The same with the words 'océan pacifique' and 'mer pacifique'. Which can be used to study the extent of the Encyclopedie on the pacific area:

In [None]:
df[df['txtContent'].str.contains("océan pacifique|mer pacifique", case=False)]


Then, the same with a more thematic search for instance about 'esclavage': 

In [None]:
df[df['txtContent'].str.contains("esclavage", case=False)]

### 1.3 Preprocessing text content

#### Tokenization and part-of-speech (POS) tagging 

In Natural Language Processing (NLP), the main first steps before processing text content consist in tokenizing sentences and words and assigning to each word its grammatical category (Part-of-Speech). Then, this allows the construction of more complex rules or queries than a simple keyword search.
This preprocessing step is language dependent and thus we have to choose the right tool according to the language of our documents. This is a major difficulty when dealing with historical or ancient texts. For instance, for French it is difficult to find a POS tagger for old French as all well known taggers are trained on contemporary corpora.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.


The PERDIDO API uses [Treetagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) for part-of-speech tagging. 

Let's have a first try of the PERDIDO API. We will first test the POS service which takes 3 parameters: the API key of the user, the language, and the text content. This service returns the annotated text in TEI-XML format.

In [None]:
api_key = 'demo' # !! replace by yours
lang = 'French'  # currently only available for French
version = 'Encyclopedie' # default: Standard (the standard version has been developped for the analysis of hiking descriptions)
gazetteer = 'wikipedia'  # default: bdnyme_ign (only for France)

content = df.loc[df['head'] == 'GANGEA'].txtContent.item()

In [None]:
# set the parameters for the PERDIDO POS tagging service
parameters = {'api_key': api_key, 'lang': lang, 'content': content}
POS_WebService = 'http://erig.univ-pau.fr/PERDIDO/api/pos/txt_xml/'

# run the PERDIDO POS service
r = requests.get(POS_WebService, params=parameters)


Now that the processed is done, we can print the result:

In [None]:
print(r.text) # shows the result of the request

In [None]:
print(xml.parseString(r.text).toprettyxml()) # shows the result in more a readable way

In [None]:
XML(bytes(r.text, 'utf-8'), style='colorful') # shows the same result with syntax color

According to the results we can notice that each word have been tokenised and annotated with the XML element <w> containing the attributes lemma and type (POS tag).

## 2. Geoparsing : geotagging + geocoding

Geoparsing is divided into two main tasks: geotagging (NER) and geocoding.

The geotagging service of the PERDIDO API uses a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Mauro Gaio and Ludovic Moncla (2019). “Geoparsing and geocoding places in a dynamic space context.“ In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

For our custom version of the PERDIDO Geoparser, the geocoding task uses a simple gazetteer lookup method. We use the French wikiGazetteer (a gazetteer based on Wikipedia and enriched with Geonames data) generated following this work: https://github.com/alan-turing-institute/lwm_GIR19_resolving_places/tree/master/gazetteer_construction
> Mariona Coll Ardanuy, Katherine McDonough, Amrey Krause, Daniel CS Wilson, Kasra Hosseini, and Daniel van Strien. (2019) “Resolving Places, Past and Present: Toponym Resolution in Historical British Newspapers Using Multiple Resources”. In Proceedings of the 13th Workshop on Geographic Information Retrieval (GIR19).

Geographic text analysis research in the digital humanities has focused on projects analyzing modern English-language corpora. 
In this tutorial we propose to highlight the difficulties of extracting and mapping geographical information from historical French texts.
As we'll see in the following, in addition to the problem of language when it comes to historical documents, the early-modern period lacks temporally appropriate gazetteers.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.


### 2.1 PERDIDO Geoparser

The PERDIDO Geoparsing service (`http://erig.univ-pau.fr/PERDIDO/api/geoparsing/`) takes 6 new parameters:
1. api_key: API key of the user
2. lang: language of the document (currently only available for French)
3. content: textual content to parse
4. mode: indicates if the query uses exact match on the name (mode: *s*) or if it uses also alternate names (mode: *a*). (default : *s*)
5. records_limit: maximum number of records found in gazetteer for each toponym (default: 1)
6. version: indicates the version of the geoparser (Encyclopedie or Standard). Default: Standard (the standard version has been developped for the analysis of hiking descriptions)

The PERDIDO Geoparser returns XML-TEI. The `<name>` element refers to named entities (proper nouns) and the type attribute indicates its class (place, person, etc.). The `<rs>` element refers to extended named entities (e.g. ville d'Egypte). The `<location>` element indicates that geographic coordinates were found during geocoding.  



As we'll see in the next cell, when we apply the PERDIDO Geoparser to the following example: (Volume 1 article 5236, available online from the [ARTFL project](https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/1/5236/))

>AZIRUTH (Géographie.) petite ville d'Egypte, sur la côte occidentale de la mer Rouge ; ce n'est presque plus qu'un village.


Three spatial entities are found during geotagging:
1. Aziruth, 
2. petite ville d'Egypte
3. la côte occidentale de la mer Rouge

while only one entity (*Egypte*) is found during geocoding:

```xml
<name type="place" subtype="edda" id="en.2">
   <w lemma="null" type="NPr" xml:id="w9">Egypte</w>
   <location>
      <geo source="wiki">35.4833 24.1333</geo>
   </location>
</name>
```

In [None]:
# set the parameters for the PERDIDO geoparsing service
content = df.loc[df['head'] == 'AZIRUTH'].txtContent.item()
parameters = {'api_key': api_key, 'lang': lang, 'content': content, 'mode': "s", "records_limit": 1, "version": version, "gazetteer": gazetteer}

# request the PERDIDO API
r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)

display(XML(bytes(r.text, 'utf-8'), style='colorful')) # shows the PERDIDO-GEOPARSER XML output

In the next cells, we will use the displacy library from spaCy to display the PERDIDO-NER XML output. For this purpose, we define the function `Perdido2displaCy()` in order to transform the PERDIDO-NER XML into a [spaCy](https://spacy.io/) compatible format.

In [None]:
''' function Perdido2displaCy()
    transforms the PERDIDO-NER XML output into spaCy format (for display purpose) '''
def Perdido2displaCy(contentXML):
    vocab = Vocab()
    words = []
    spaces = []
    root = etree.fromstring(bytes(contentXML, 'utf-8'))
    contentTXT = ""
    for w in root.findall('.//w'):
        contentTXT += w.text + ' '
        words.append(w.text)
        spaces.append(True)
    doc = Doc(vocab, words=words, spaces=spaces)
    ents = [] 
    for child in root.findall('.//rs'):
        if not parent_exists(child, 'rs'):
            if 'startT' in child.attrib:
                start = child.get('startT')
                if 'endT' in child.attrib:
                    stop = child.get('endT')
                    if 'type' in child.attrib:
                        if child.get('type') == 'place':
                            type = 'LOC'
                        elif child.get('type') == 'person':
                            type = 'PERSON'
                        else:
                            type = 'MISC'
                    else:
                        type = 'MISC'
                    ents.append(Span(doc, int(start), int(stop), label=type))
    doc.ents = ents
    return doc 

''' function parent_exists() 
    returns True if one of the ancestor of the element child_node have the name name_node''' 
def parent_exists(child_node, name_node):
    try:
        parent_node = next(child_node.iterancestors())
        if parent_node.tag == name_node:
            if 'startT' in parent_node.attrib:
                return True
        return parent_exists(parent_node, name_node)
    except StopIteration:
        return False

In [None]:
doc = Perdido2displaCy(r.text) # transforms the PERDIDO-GEOPARSER XML

displacy.render(doc, style="ent", jupyter=True) # shows the PERDIDO-GEOPARSER XML output using the displacy library

### 2.2 Mapping place names from one article

If you're interested in geocoding, you probably want to display the result on a map. 
There are two solutions, 
1. you can parse the PERDIDO-GEOPARSER XML and extract each `<location>` element in order to get lat/long coordinate of each entity
2. you can use the PERDIDO-GEOCODING service `http://erig.univ-pau.fr/PERDIDO/api/geocoding/`. It takes the same parameters than the geoparsing service and returns the result as *geojson*.

Let's try solution n°2:

In [None]:
''' function get_bounding_box() returns a list containing the bottom left and the top right 
    points in the sequence '''
def get_bounding_box(points):
    bot_left_x = min(point[1] for point in points)
    bot_left_y = min(point[0] for point in points)
    top_right_x = max(point[1] for point in points)
    top_right_y = max(point[0] for point in points)
    return [(bot_left_x, bot_left_y), (top_right_x, top_right_y)]

''' function display_map() display the map using the folium library '''
def display_map(json_data):
    coords = list(geojson.utils.coords(json_data))
    if len(coords) > 0:
        print(str(len(coords))+" records found in gazetteer:")

        m = folium.Map()
        m.fit_bounds(get_bounding_box(coords), max_zoom=5)
        folium.GeoJson(data, name='Toponyms', tooltip=folium.features.GeoJsonTooltip(fields=['id', 'name', 'source'], localize=True)).add_to(m)

        display(m)
    else:
        print("Sorry, no records found in gazetteer for geocoding!")

In [None]:
r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)


display_map(data)

Now, let's change the parameters. For geocoding, we will now use alternate names and limit the maximum number of records to 5.

In [None]:

parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 'a', 
              'records_limit': 5, 
              'version': version, 
              'gazetteer': gazetteer}

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)


Let's try again with a maximum number of record of 1:

In [None]:
parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 'a', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)


Sometimes place names can be found in the text but there may be no result for geocoding. This means that none of the entity have records found in gazetteer. This is often the case for historical documents and for periods for which no appropriate gazetteer exists.

In [None]:
content = df.loc[df['head'] == 'AZMER'].txtContent.item()


parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 'a', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)


In [None]:
content = df_geo.loc[df_geo['head'] == 'DAUPHINE'].txtContent.item()

parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 's', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)

### 2.3 Mapping several articles

In [None]:
df_pacifique = df_geo[df_geo['txtContent'].str.contains("océan pacifique|mer pacifique", case=False)]
df_pacifique

In [None]:
content = ', '.join(df_pacifique['head'].tolist())
content

In [None]:
parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 's', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)

We change the parameter 'mode' to search also for alternate names:

In [None]:
parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 'a', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)

Geocoding headwords of articles containing the word 'esclavage'

In [None]:
df_esclavage = df_geo[df_geo['txtContent'].str.contains("esclavage", case=False)].reset_index(drop = True)
content = ', '.join(df_esclavage['head'].tolist())

parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              'mode': 'a', 
              'records_limit': 1, 
              'version': version,
              'gazetteer': gazetteer
             }

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
doc = Perdido2displaCy(r.text)

displacy.render(doc, style="ent", jupyter=True)


r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
data = geojson.loads(r.text)

display_map(data)

## 3. Toponym disambiguation using network analysis

In our work, we use this methodoly for constructing a network based on the citation of "géographie" articles between them.
We proposed to use network analysis measures to establish an approximate location, defined by qualitative relations, for each named toponym in EDDA. Throwing a list of decontextualized toponyms at an external resource like Geonames is risky. We therefore hypothesize that defining meaningful links between places can provide essentialinformation to improve disambiguation (and potentially replace resolution as the end goal). We establish connections between places based on the citation of “headword” toponyms (those that appearas headwords of entries) in other EDDA entries.

>Moncla, L., McDonough, K., Vigier, D., Joliveau, T., & Brenon, A. (2019). Toponym disambiguation in historical documents using network analysis of qualitative relationships. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, 1–4. Chicago, IL, USA.

This method draws on relations in the corpus of EDDA articles, which improves disambiguation at a later stage with an external resource. We suggest the network as an alternative to geospatial representation, a useful proxy when no historical gazetteer exists for the source material's period. Our first experiments have shown that this approach goes beyond a simple text analysis and is able to find relations between toponyms that are not co-occurring in the same documents. Network relations are also usefully compared with disambiguated toponyms to evaluate geographical coverage, and the ways that geographical discourse is expressed, in historical texts.


<table>
  <tr>
    <td> <img src="img/labels_indegree2.png" width ="500px"> </td>
    <td> <img src="img/nodes_betweenness+class2.png" width ="500px" > </td>
  </tr>
  <tr>
    <td>Node and label size indicate in-degree centrality</td>
    <td>Node size indicates betweenness centrality<br/> 
        colors refer to geographic feature types <br/> 
        (city: red, hydronym: blue, country: green, mountain: brown, unclassified: grey)</td>
  </tr>
</table> 



We also made somse preliminary tests by assigning geographic coordinates found in our French wikiGazetteer to each node (headword). We have only 2535 nodes with coordinates over the 13734 nodes. 

Our first experiment is shown below. Colors identify clusters of nodes computed with the [modularity measure](https://en.wikipedia.org/wiki/Modularity_(networks)) implemented on Gephy.

<table><tr>
<td> <img src="img/geocodingEDDA1.png" width ="500"> </td>
<td> <img src="img/geocodingEDDA_network.png" width ="500" > </td>
</tr></table> 