-
Notifications
You must be signed in to change notification settings - Fork 10
Session 8: Data Structuring and Querying
Date: Thursday, November 17, 2016, 16h00 (UK time)
Session coordinator: Tom Elliott and Sebastian Heath (Institute for the Study of the Ancient World, NYU)
YouTube link: https://youtu.be/39Ao9_mfCxE
Tom's Slides: a huge googly link
Sebastian's Google Doc: https://docs.google.com/document/d/1jYVwDnf-AvUNuOHLNkvvnQJtaYoXqACCSl_QRhdwqls
This session will survey some of the techniques available for structuring data and finding information in data that has been structured by others. Formal data structures in computing will be introduced, and we will relate these to some of the data formats and encodings found in use today in the field of ancient studies, including:
- Comma-Separated Values (CSV)
- JavaScript Object Notation (JSON) and its specializations GeoJSON and JSON-LD
- Extensible Markup Language (XML) and its use in the Text Encoding Initiative (TEI) and EpiDoc encodings.
We will also address query and data-manipulation languages suited to these formats, including:
- Regular Expressions (regexes): for text of various kinds
- XPATH: for XML
- SQL: for tabular data, such as that found in CSV
- Python dictionaries and JavaScript associative arrays
- ‘Data Structure’, Wikipedia, the Free Encyclopedia, 2016 https://en.wikipedia.org/w/index.php?title=Data_structure&oldid=740892231
- ‘Data Manipulation Language’, Wikipedia, the Free Encyclopedia, 2016 https://en.wikipedia.org/w/index.php?title=Data_manipulation_language&oldid=723127954
- ‘Regular Expression Tutorial - Learn How to Use Regular Expressions’ http://www.regular-expressions.info/tutorial.html
- ‘List of File Formats’, Wikipedia, the Free Encyclopedia, 2016 https://en.wikipedia.org/w/index.php?title=List_of_file_formats&oldid=741522273
- Y. Shafranovich, Common Format and MIME Type for Comma-Separated Values (CSV) Files, Internet Engineering Task Force (IETF) Request for Comments (Internet Engineering Task Force (IETF), October 2005) https://tools.ietf.org/html/rfc4180
- T. Bray, The JavaScript Object Notation (JSON) Data Interchange Format, Internet Engineering Task Force (IETF) Request for Comments (Internet Engineering Task Force (IETF), March 2014) https://tools.ietf.org/html/rfc7159
- H. Butler, M. Daly, A. Doyle, S. Gillies, S. Hagen, and T. Schaub, The GeoJSON Format, Internet Engineering Task Force (IETF) Request for Comments (Internet Engineering Task Force (IETF), August 2016) https://tools.ietf.org/html/rfc7946
- ‘JSON-LD - JSON for Linking Data’ http://json-ld.org/
Edit the SPARQL queries on the Nomisma.org website to query numismatic data. Discuss the historical questions you can ask with these queries. How can you use the results of these queries in further processing or visualizations?
- Example queries: http://nomisma.org/documentation/sparql
- Search form: http://nomisma.org/sparql
Select and download a copy of an ancient text in English translation (e.g., via Perseus or another online textual source). Use a regular-expression-capable text editor or command-line tool to identify and extract all of the words in the text that begin with a capital letter. Arrange these words into a list, and then manually determine (with reference back to the text) which are proper nouns. Save these proper nouns in CSV format for use in additional processing.