Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Named Entity Extraction I
Date: Thursday, March 23, 2017, 17h00-18h15 (CET time)
Session coordinators: Matteo Romanello and Francesco Mambrini (Deutsches Archäologisches Institut, Berlin)
YouTube link: https://www.youtube.com/watch?v=yxolXPWxmQ0
Slides: Sunoikisis DC, session 9
This session aims to 1) give participants a good understanding of basic programming concepts and constructs in Python and 2) show how to use them in order to extract named entities (e.g. names of people, names of places) from text, a task also known as Named Entity Recognition (NER).
Our approach will be as task-oriented and hands-on as possible. In order to step directly into writing and executing some Python code, participants we will provided with an account to access an iPython notebook server (see instructions below). An iPython notebook is essentially a collection of code snippets and other resources (e.g. text, images, etc.) that are useful to understand what the code does (here an example). A notebook server is a web-based, interactive programming environment where such notebooks can be created and executed live.
In the following session we will cover some more advanced concepts including NER by means of regular expressions, the evaluation of NER systems and how to do some basic statistics of extracted named entitities.
- Introduction: Information Extraction and Named Entity Recognition (NER)
- NER: definitions and tasks (extraction, classification, disambiguation)
- basic programming concepts in Python (loops, if statements, exception catching, variables, functions, basic data structures, input/output)
- Doing NER from Latin texts with CLTK and NLTK
S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly, 2009. Available at: http://www.nltk.org/book/:
Dan Jurafsky and James H. Martin. Speech and Language Processing, Chapter 21, p. 1-7 (Information Extraction). 3rd edition draft available at https://web.stanford.edu/~jurafsky/slp3/21.pdf.
- C. Neudecker, Named entity recognition for digitised historical newspapers.
- M. Romanello (2016), “Exploring Citation Networks to Study Intertextuality in Classics” in Digital Humanities Quarterly, 2(1). Available at http://www.digitalhumanities.org/dhq/vol/10/2/000255/000255.html
- Giovanni Moretti, Sara Tonelli, Stefano Menini and Rachele Sprugnoli (2014). "ALCIDE: An online platform for the Analysis of Language and Content In a Digital Environment" . In Proceedings of the First Italian Conference on Computational Linguistics (CLIC-2014), Pisa, Italy. Available at http://www.fileli.unipi.it/projects/clic/proceedings/vol1/CLICIT2014152.pdf
Using the iPython notebook server
- go to http://nlp.dainst.org:8888
- enter the password you should have received by email
- select a notebook by ticking the box beside the filename (extension
- press the
duplicatebutton that will appear
- select the notebook copy you just created (selection works as above)
- rename it by adding your initials at the end of the file name (e.g.
- you're done! -- during the common session you will be running and working on your copy of the notebooks
The exercise is described in last section of this notebook.
The student is asked to modify the code examples contained in the notebook to extract the named entities from the English translation of Caesar's De Bello Gallico, book 1, available from Perseus.