# Common Terminology in Data Science and Digital Bibliography

Fields like Data Science, Computer Science, and Statistics (the primary domains that underlie computational humanities) are broad fields encompassing countless specialties and subdomains.  Digital Humanities, writ large, is one such subdomain, as is Digital Bibliography.  Nobody (not even Elon Musk, despite what he might think) can be expert in all computational methods and approaches.  While there are some core practices and methodolgies that are common to nearly all computational work, particular subdomains also tend to develop their own unique methods, workflows, and vocabularies.  The section below introduces some of the most common computing terms and concepts that you will encounter as a Digital Bibliographer.  These are all covered in greater detail in subsequent modules in the course; however, because they are so common, you are likely to hear reference to each of them before you actually learn how to do them.  As such, we are providing high altitude explanations here by way of introduction.

# API: 
An *Application Programming Interface (API)* is the name used to desribe an internet site that is designed to be used by computers rather than people.  *API's* are accessed using a URL just like a website, but instead of returning a human readable webpage, they return raw data meant to be used by a computer.

# Corpus
When we refer to a corpus, we are referring to a specific collections of texts that is being analysed.  For example, the English Broadside Ballad Archive contains a large collection of digitized English broadside ballads primarily from the 17th century.  If I were interested in performing text analysis on ballads from the 17th Century, I might use all of the 17th-century ballads from this collection and also combine them with 17th-century ballads from the Bodleian Library's ballad collection.  This set of texts (17th-century ballads from two digital collections) would constitute my *corpus*.

# Machine Readable:
When we say that something is Machine Readable, we mean that the data/text is formatted in a way that makes it easy for another computer to read and ingest the data, as opposed to Human Readable text or data which is formatted in a way designed to make it easily readable by a human.

# Munging
This is a real term used by Data Scientists to describe the activity of merging different datasets or fields of data together into a single dataset or field. If, for example, we collected MARC records or text for analysis from multiple sources, we would munge them together into a single dataset or corpus prior to analysis. We might similarly munge first and last names from a dataset into a single, comma separated field before displaying a name on website, etc.

# Program: 
We often speak of the work of writing instructions that a computer should follow as *programming*.  Technically speaking, this is not incorrect.  And the work we do as digital bibliographers is considered programming.  However, when we talk about the output of our programming, the code that we write in Python, we are not actually creating computer programs.  Computer scientists generally reserve the word *program* to describe code that must be compiled, a process that uses special software to convert human-readable instructions into binary, executable code, before it can be run.  Microsoft Word, Photoshop, your email client, etc., are all compiled computer *programs*.  If you tried to open them with a text editor, you would see only gibberish, binary code, that cannot be edited or changed.  Code that does not need to be compiled before it can be run is known as a *script*, and all of the Python code you will create in this course will be in the form of *scripts* and not *programs*.  See the entry on *Scripts* below for more information on scripts.

# Scraping: 
The term scraping is generally applied to describe any process used to acquire text and/or data from the internet.  By its most strict definition, scraping actually describes only situations where we are "scraping" human-readable text from web pages, as opposed to connecting directly to computing systems that have gateways designed to deliver information directly to other computers.  But over time, the term has generalized and is now commonly used to describe any situation where a computer reaches out across the network to gather data and text.

# Script: 
When we code in Python, we are creating computer *scripts*, as opposed to computer *programs* as defined above.  *Scripts* do not need to be compiled (converted to binary code) before they can be run.  Instead, they are run through an interpreter, a piece of software that converts code at runtime.  *Scripts* offer the distinct advantage that they remain human readable and editable, so they are easier to work with as part of constantly changing workflows.  They do, however, run slower than compiled *programs*, which can become an issue when dealing with large datasets.

# Subsetting: 
A high-level term used to describe the process of extracting a dataset of interest from a larger dataset.  For example, you might subset all of the cataloguing data in the ESTC to include only records for items printed in the 17th century.  You will spend a great deal of your time as a digital bibliographer subsetting data.