Geoparsing Tutorial Notebook
Jupyter notebook for geoparsing historical encyclopedia texts in French using the PERDIDO Geoparser.
This notebook is proposed by L. Moncla (INSA Lyon) and K. McDonough (The Alan Turing Institute) as part of the GEODE project.
In this tutorial, we demonstrate how to use a custom version of the Perdido geoparser python library developed in the GEODE project. We will use texts from Diderot and d’Alembert’s Encyclopédie as a case study for querying a corpus and wrangling geoparsed data. We will also compare Perdido’s NER annotations (e.g. it's output) to the results of other well-known python NER libraries (spaCy and Stanza).
In this tutorial, we'll learn about a few different things.
- How to load data from TEI-XML files into a Python dataframe
- Use Python dataframe for simple data analysis
- Test the PERDIDO API for preprocessing French texts (part-of-speech tagging)
- Test the PERDIDO API for geoparsing (geotagging + geocoding) Encyclopedie articles
- Display custom geotagging results (PERDIDO TEI-XML) with the displaCy Named Entity Visualizer
- Display geocoding results on a map
Open the notebook in the cloud
You can open this notebook in an executable and remote environment with or
Set up a python environment
Clone this github repository
git clone https://github.com/GEODE-project/perdido-geoparsing-notebook.git
Configure the environment with all dependencies
- Create a new environment called
conda create -n tutorial-geoparsing-py39 python=3.9
- Activate the environment
conda activate tutorial-geoparsing-py39
fionapackage with conda (avoid an issue with
conda install fiona==1.8.21
- Install dependencies with
pip install -r requirements.txt
Launch the jupyter server
Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).