# GEOPARSING FOR HIKE DESCRIPTIONS

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) as part of the [CHOUCAS](http://choucas.ign.fr/) (2017-2021) project.


## Overview

In this tutorial, we'll learn about a few different things.

- How to use the [PERDIDO API](http://erig.univ-pau.fr/PERDIDO/api.jsp) for geoparsing (geotagging + geocoding) French hike descriptions
- Display custom geotagging results (PERDIDO TEI-XML) with the [displaCy Named Entity Visualizer](https://spacy.io/usage/visualizers)
- Display geocoding results on a map

## Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding consists to find unambiguous geographic coordinates.



The geotagging service of the PERDIDO API uses a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Gaio, M. and Moncla, L. (2019). "Geoparsing and geocoding places in a dynamic space context." In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

The geocoding task uses a simple gazetteer lookup method. 
For geocoding French hike description we use the bdnyme database provided by IGN.
In this notebook, we'll use the GPS trace associated with each hike description to compute to find the area where entities should be located. This will help to reduce toponym ambiguities during the geocoding step.


### PERDIDO Geoparser API


The [PERDIDO API](http://erig.univ-pau.fr/PERDIDO/) has been developped for extracting and retrieving displacements from unstructured texts. 
> Moncla, L., Gaio, M., Nogueras-Iso, J., & Mustière, S. (2016). "Reconstruction of itineraries from annotated text with an informed spanning tree algorithm." International Journal of Geographical Information Science, 30, 1137–1160.

In this tutorial we'll see how to use the PERDIDO API for geoparsing French hike descriptions. 
We will apply geoparsing on some hike descriptions downloaded from the [visorando](https://www.visorando.com/) web sharing platform.

The PERDIDO Geoparsing and Geocoding services (`http://erig.univ-pau.fr/PERDIDO/api/geoparsing/`) take 4 parameters:
1. api_key: API key of the user
2. lang: language of the document (currently only available for French)
3. content: textual content to parse
4. bbox: allows to filter entities locations using a bounding box. 

The PERDIDO Geoparser returns XML-TEI. The `<name>` element refers to named entities (proper nouns) and the type attribute indicates its class (place, person, etc.). The `<rs>` element refers to extended named entities (e.g. refuge du Bois). The `<location>` element indicates that geographic coordinates were found during geocoding.  


```xml
<rs type="place" subtype="ene" start="13" end="27" startT="3" endT="6" id="en.13">
   <term type="place" start="13" end="19" startT="3" endT="4">
      <w lemma="refuge" type="N" xml:id="w4">Refuge</w>
   </term>
   <w lemma="du" type="PREPDET" xml:id="w5">du</w>
   <rs type="unknown" subtype="no" start="23" end="27" startT="5" endT="6" id="en.14">
      <name type="unknown" id="en.1">
         <w lemma="null" type="NPr" xml:id="w6">Bois</w>
      </name>
   </rs>
   <location>
       <geo source="bdnyme">6.744359 45.459557</geo>
   </location>
</rs>
```

## Getting started


First, you need to register on the PERDIDO website to get your API key: http://erig.univ-pau.fr/PERDIDO/api.jsp

In [None]:
# if some libraries are not installed on your environment (this is the case with binder)
!pip3 install spacy
!pip3 install lxml
!pip3 install gpxpy
!pip3 install geojson
!pip3 install folium

In [None]:
import requests
import glob

from spacy.tokens import Span
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy import displacy

import lxml.etree as etree

import gpxpy
import gpxpy.gpx

import geojson
import folium
from IPython.display import display

Let's define some useful functions.

In [None]:
''' function Perdido2displaCy()
    transforms the PERDIDO-NER XML output into spaCy format (for display purpose) '''
def Perdido2displaCy(contentXML):
    vocab = Vocab()
    words = []
    spaces = []
    root = etree.fromstring(bytes(contentXML, 'utf-8'))
    contentTXT = ""
    for w in root.findall('.//w'):
        contentTXT += w.text + ' '
        words.append(w.text)
        spaces.append(True)
    doc = Doc(vocab, words=words, spaces=spaces)
    ents = [] 
    for child in root.findall('.//location'):
        rs = get_parent(child, 'rs')
        if rs is not None:
            if not loc_in_parent(rs):
                if 'startT' in rs.attrib:
                    start = rs.get('startT')
                    if 'endT' in rs.attrib:
                        stop = rs.get('endT')
                        type = 'LOC'
                        ents.append(Span(doc, int(start), int(stop), label=type))          
    for child in root.findall('.//rs[@type="place"]'):
        if not parent_exists(child, 'rs', 'place'):
            if not loc_in_child(child):
                if 'startT' in child.attrib:
                    start = child.get('startT')
                    if 'endT' in child.attrib:
                        stop = child.get('endT')
                        type = 'MISC'
                        ents.append(Span(doc, int(start), int(stop), label=type))       
    doc.ents = ents
    return doc 


''' function parent_exists() 
    returns True if one of the ancestor of the element child_node have the name name_node''' 
def parent_exists(child_node, name_node):
    try:
        parent_node = next(child_node.iterancestors())
        if parent_node.tag == name_node:
            if 'startT' in parent_node.attrib:
                return True
        return parent_exists(parent_node, name_node)
    except StopIteration:
        return False

    
''' function parent_exists() 
    returns True if one of the ancestor of the element child_node have the name name_node''' 
def parent_exists(child_node, name_node, type_val):
    try:
        parent_node = next(child_node.iterancestors())
        if parent_node.tag == name_node:
            if 'type' in parent_node.attrib:
                if parent_node.get('type') == type_val:
                    if 'startT' in parent_node.attrib:
                        return True
        return parent_exists(parent_node, name_node, type_val)
    except StopIteration:
        return False

''' function loc_in_parent() 
    returns a boolean, true if the element location is found in the <rs> ancestor or false ''' 
def loc_in_parent(node):
    try:
        parent_node = next(node.iterancestors())
        if parent_node.tag == "rs":
            if parent_node.find('location') is not None:
                return True
            else:
                return loc_in_parent(parent_node)
        else:
            return False
    except StopIteration:
        return False

    
''' function loc_in_child() 
    returns a boolean, true if the element location is found in a child element or false ''' 
def loc_in_child(node):
    #root.find('./text/body/div1/index[@type="head"]').get('value')
    child_node = node.find('.//location')
    if child_node is not None:
        return True
    else:
        return False

''' function get_parent() 
    returns the first ancestor of the element child_node that have the name name_node ''' 
def get_parent(child_node, name_node):
    try:
        parent_node = next(child_node.iterancestors())
        if parent_node.tag == name_node:
            if 'startT' in parent_node.attrib:
                return parent_node
        return get_parent(parent_node, name_node)
    except StopIteration:
        return None
    

''' function display_map() display the map using the folium library '''
def display_map(json_data):
    coords = list(geojson.utils.coords(json_data))
    
    ave_lat = sum(p[0] for p in coords)/len(coords)
    ave_lon = sum(p[1] for p in coords)/len(coords)
    
    if len(coords) > 0:
        print(str(len(coords))+" records found in gazetteer:")

        m = folium.Map(location=[ave_lat, ave_lon], zoom_start=12)
        folium.GeoJson(data, name='Toponyms', tooltip=folium.features.GeoJsonTooltip(fields=['id', 'name', 'source'], localize=True)).add_to(m)

        display(m)
    else:
        print("Sorry, no records found in gazetteer for geocoding!")
        
        
''' function display_map_gpx() display the map using the folium library '''
def display_map_gpx(json_data, gpx_filename):
    
    gpx = gpxpy.parse(open(gpx_filename, 'r'))
    
    points = []
    for track in gpx.tracks:
        for segment in track.segments:        
            for point in segment.points:
                points.append(tuple([point.latitude, point.longitude]))
    #print(points)
    ave_lat = sum(p[0] for p in points)/len(points)
    ave_lon = sum(p[1] for p in points)/len(points)
    
    m = folium.Map(location=[ave_lat, ave_lon], zoom_start=12)
    
    folium.PolyLine(points, color="red", weight=2.5, opacity=1).add_to(m)
    
    coords = list(geojson.utils.coords(json_data))
    
    if len(coords) > 0:
        print(str(len(coords))+" records found in gazetteer:")
        folium.GeoJson(json_data, name='Toponyms', tooltip=folium.features.GeoJsonTooltip(fields=['id', 'name', 'source'], localize=True)).add_to(m)       
    else:
        print("Sorry, no records found in gazetteer for geocoding!")
        
    display(m)
    
            
''' function get_bbox() returns the bounding box of a given GPS trace '''
def get_bbox(gpx_filename):
    
    latitudes = []
    longitudes = []
    elevations = []
   
    gpx = gpxpy.parse(open(gpx_filename, 'r'))
    
    points = []
    for track in gpx.tracks:
        for segment in track.segments:
            for point in segment.points:
                latitudes.append(point.latitude)
                longitudes.append(point.longitude)
                elevations.append(point.elevation)

    return str(min(longitudes))+' '+str(min(latitudes))+' '+str(max(longitudes))+' '+str(max(latitudes))

## Setting parameters

In [None]:
api_key = 'demo' # !! replace by yours
lang = 'French'  # currently only available for French
version = 'Standard' # default: Standard 

In [None]:
# get the list of txt files from the data directory
txtfiles = []
for file in glob.glob("data/*.txt"):
    txtfiles.append(file[:-4])
    
print(txtfiles)

In [None]:
# get the textual content from file
file = open(txtfiles[0]+".txt", "r")

content = ""
for paragraph in file:
    content += paragraph
    
print(content)

In [None]:
# get the bounding box from the gps trace
gpx_filename = txtfiles[0]+".gpx"
bbox = get_bbox(gpx_filename)

#print(bbox)

# set the parameters for the PERDIDO API
parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              "bbox": bbox}

## Call the geoparsing REST API

In [None]:
r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
print(r.text) # shows the result of the request
#you can parse this XML to retrieve the information you are interested in


In the next cells, we will use the displacy library from spaCy to display the PERDIDO-NER XML output. For this purpose, we defined the function `Perdido2displaCy()` in order to transform the PERDIDO-NER XML into a [spaCy](https://spacy.io/) compatible format. Geocoded toponyms are marked in orange (with the label: LOC) while toponyms that are not associated with a location are marked in grey (with the label: MISC).

In [None]:
doc = Perdido2displaCy(r.text)
displacy.render(doc, style="ent", jupyter=True)

## Call the geocoding REST API

In [None]:
r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)

#print("geojson : "+r.text) ## you can save the geojson in a file if needed

data = geojson.loads(r.text)

display_map_gpx(data, gpx_filename)

## In brief

In [None]:
print(len(txtfiles))

In [None]:
# choose the file you want to process among the 30 hike descriptions
id_file = 11

file = open(txtfiles[id_file]+".txt", "r")

# get the textual content from file
content = ""
for paragraph in file:
    content += paragraph
    
gpx_filename = txtfiles[id_file]+".gpx"

bbox = get_bbox(gpx_filename)

parameters = {'api_key': api_key, 
              'lang': lang, 
              'content': content, 
              "bbox": bbox}

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geoparsing/', params=parameters)
displacy.render(Perdido2displaCy(r.text), style="ent", jupyter=True)

r = requests.get('http://erig.univ-pau.fr/PERDIDO/api/geocoding/', params=parameters)
display_map_gpx(geojson.loads(r.text), gpx_filename)