Session 8: Data Structuring and Querying

Date: Thursday, November 17, 2016, 16h00 (UK time)

Session coordinator: Tom Elliott and Sebastian Heath (Institute for the Study of the Ancient World, NYU)

YouTube link: https://youtu.be/39Ao9_mfCxE

Sebastian's Google Doc: https://docs.google.com/document/d/1jYVwDnf-AvUNuOHLNkvvnQJtaYoXqACCSl_QRhdwqls

Outline

This session will survey some of the techniques available for structuring data and finding information in data that has been structured by others. Formal data structures in computing will be introduced, and we will relate these to some of the data formats and encodings found in use today in the field of ancient studies, including:

Comma-Separated Values (CSV)
JavaScript Object Notation (JSON) and its specializations GeoJSON and JSON-LD
Extensible Markup Language (XML) and its use in the Text Encoding Initiative (TEI) and EpiDoc encodings.

We will also address query and data-manipulation languages suited to these formats, including:

Regular Expressions (regexes): for text of various kinds
XPATH: for XML
SQL: for tabular data, such as that found in CSV
Python dictionaries and JavaScript associative arrays

Required reading

‘Data Structure’, Wikipedia, the Free Encyclopedia, 2016 https://en.wikipedia.org/w/index.php?title=Data_structure&oldid=740892231
‘Data Manipulation Language’, Wikipedia, the Free Encyclopedia, 2016 https://en.wikipedia.org/w/index.php?title=Data_manipulation_language&oldid=723127954
‘Regular Expression Tutorial - Learn How to Use Regular Expressions’ http://www.regular-expressions.info/tutorial.html

Practical exercises

1. Querying

Edit the SPARQL queries on the Nomisma.org website to query numismatic data. Discuss the historical questions you can ask with these queries. How can you use the results of these queries in further processing or visualizations?

Example queries: http://nomisma.org/documentation/sparql
Search form: http://nomisma.org/sparql

2. Regular expressions

Select and download a copy of an ancient text in English translation (e.g., via Perseus or another online textual source). Use a regular-expression-capable text editor or command-line tool to identify and extract all of the words in the text that begin with a capital letter. Arrange these words into a list, and then manually determine (with reference back to the text) which are proper nouns. Save these proper nouns in CSV format for use in additional processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly