# An Introduction to Analysing In the Spotlight Data Using Python

The purpose of this notebook is to help you become more familiar with the data structures used to store [LibCrowds](https://www.libcrowds.com/)' results data, as well as introduce you to a key Python library that can be used to request, manipulate, analyse and, in later notebooks, visualise that data.

The Python library used here is called [pandas](https://pandas.pydata.org/), which provides access to high-performance data analysis tools via a relatively accessible Python interface. We will use the library to load all of our *In the Spotlight* results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any relatively modern computer this is unlikely to be an issue until we reach tens of millions of results.

We begin by importing the pandas library.

In [30]:
import pandas

## The dataset

As mentioned above, our input for this notebook will be all of the results generated so far through the crowdsourcing projects presented on [*In the Spotlight*](https://www.libcrowds.com/collections/playbills). These results are stored as Web Annotations; a W3C standard used to make data more easily reusable online. The abstract from the [Web Annotation Data Model](https://www.w3.org/TR/annotation-model) is presented below:

> Annotations are typically used to convey information about a resource or associations between resources. Simple examples include a comment or tag on a single web page or image, or a blog post about a news article.
>
> The Web Annotation Data Model specification describes a structured model and format to enable annotations to be shared and reused across different hardware and software platforms. Common use cases can be modeled in a manner that is simple and convenient, while at the same time enabling more complex requirements, including linking arbitrary content to a particular data point or to segments of timed multimedia resources.
>
> The specification provides a specific JSON format for ease of creation and consumption of annotations based on the conceptual model that accommodates these use cases, and the vocabulary of terms that represents it.

By using this standardised structure for our final results we aim to make the crowdsourced data generated via the platform more easily reusable online, providing ways for researchers to answer specific questions via programmatic means.

As well as the results data, we use Web Annotations to store additional user-generated data, such as image tags. All of these annotations are available via a public API, which also complies with a [standardised protocol](https://www.w3.org/TR/annotation-protocol/). By using the API we can gain programmatic access to our current results data. Depending on the objective, this method of consuming and analysing the data is likely to present clear advantages when compared with performing similar visualisations using offline datasets (i.e. those downloaded to your computer). Importantly, data requested via the api will always be up-to-date; very soon after a task is completed the contributions will be analysed and the result made available via the API. For ongoing projects, offline datasets are by their nature out-of-date the minute you download them.

## The API

The LibCrowds Web Annotation API is available at the following location:

https://annotations.libcrowds.com/annotations/

Again, results are returned in containers where each key has some semantic meaning. To avoid overloading this notebook with new concepts (perhaps it's too late), we won't go into the meaning of these keys. For now, it is enough to understand that there are programmatic ways to navigate a collection of annotations and pull out the data required.

In addition to the standard protocol the API includes an endpoint, declared below, to which we can make a request for an entire collection of annotations.

In [31]:
IRI = 'https://annotations.libcrowds.com/export/playbills-results/'

Note that this dataset could be very large. If you were to enter the address above into a web-browser it is likely that it would run out of memory before it ever finished loading, it is intended primarily as a way of exporting the data programmatically. 

Further, depending on the aspect of the data that we are interested in, requesting the entire dataset may be a computationally inefficient. The API includes other endpoints that can be used to search for specific data, which may be explored in other notebooks. However, for the sake of simplicity, in this notebook we will simply load the entire dataset into a dataframe, before filtering out the parts that we don't need.

In [32]:
df = pandas.read_json(IRI)

ValueError: Expected object or value

That's it, we now have a dataframe containing all of our *In the Spotlight* results data.

## The data model

The LibCrowds data model describes precisely how we use Web Annotations to model our results data, so it might also be work opening up the documentation for reference.

To summarise, in this notebook we will be focussing on the transcription annotations, which are generated after at least three people have contributed to a single task. A task may be, for example, to transcribe a previously marked up region on an image. These contributions are normalised and the final result saved as an annotation, if three of the normalised contributions match. If we don't find a match the tasks are sent back into the queue.

The annotations are generated with the structure presented below. Don't worry if this doesn't make any sense just yet, the relevant sections will be explored in a bit more detail later. For now, note that the annotations are stored as JSON-LD; a method of encoding Linked Data using JSON. Each key has some semantic meaning, making it easier for other hardware and software platforms to consume the data. For our current purposes, being analysis of the transcribed data, we only need to focus on a subset of these keys. Below, we will find out how to manipulate this data to locate only the parts of it we require.

```json-ld
{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "https://annotations.libcrowds.com/annotations/playbills-results/7640ddcd-6e48-4a9c-a360-3383032593b6",
  "type": "Annotation",
  "motivation": "describing",
  "created": "018-02-08T22:15:07.152Z",
  "generated": "018-02-08T22:15:07.152Z",
  "generator": [
    {
      "id": "https://github.com/LibCrowds/libcrowds",
      "type": "Software",
      "name": "LibCrowds",
      "homepage": "https://www.libcrowds.com"
    },
    {
      "id": "https://backend.libcrowds.com/api/result/42",
      "type": "Software"
    }
  ],
  "body": [
    {
      "type": "TextualBody",
      "purpose": "tagging",
      "value": "title"
    },
    {
      "type": "TextualBody",
      "purpose": "describing",
      "value": "King Lear",
      "format": "text/plain"
    }
  ],
  "target": {
    "source": "https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022589096.0x0002b7",
    "selector": {
      "conformsTo": "http://www.w3.org/TR/media-frags/",
      "type": "FragmentSelector",
      "value": "?xywh=7,1191,1962,359"
    }
  }
}
```

## Processing the dataset

The dataframe we created earlier contains one row per annotation, where an annotation represents a single result, and a column for each top level key found in the annotations. This is perhaps easiest to understand by looking at a few rows of the dataframe and comparing them to the example JSON-LD annotation presented above.

The following syntax can be used to return a slice of the dataframe, in this case we select the first three rows.

In [None]:
df[:3]

While these columns all serve a purpose and are essential for proper integration with other software platforms, it is likely that for our analyses we will often only be interested in a few core parts of the dataset. Namely, the transcriptions provided by our volunteers, along with an indication of what they were transcribing. In Web Annotation terms, these would be referred to as the **body** and **target**, respectively. This is summarised in the following extract from the [Web Annotation Data Model](https://www.w3.org/TR/annotation-model/):

> An Annotation is a Web Resource. Typically, an Annotation has a single Body, which is a comment or other descriptive resource, and a single Target that the Body is somehow "about". The Annotation likely also has additional descriptive properties.

Among these additional descriptive properties is the **motivation**, which specifies the reason for the Annotation's creation. LibCrowds' results are generated with one of three possible motivations: tagging, describing and commenting. The code below selects the motivation column and returns all of the unique values it contains as an array.

In [None]:
df['motivation'].unique()

For many of our analyses, we will only be interested in those annotations with the describing motivation, which is the type of annotation that we use to store our transcription data. The decision to use *describing* as the motivation here (rather than, say, *transcribing*) is taken as again, *describing* has a semantic meaning as defined by the standard [Web Annotation Vocabulary](https://www.w3.org/TR/annotation-vocab/#describing):

> **2.3.5 describing:** The motivation for when the user intends to describe the Target, as opposed to (for example) a comment about it. 

In future notebooks we may go back and look at the comments or tags, however, for now we no longer need these rows in our dataframe. It is possible to select only those annotations where the motivation is equal to describing and replace the current dataframe with the result, using the code below.

In [None]:
df = df[df['motivation'] == 'describing']

To verify that this has worked we can again take a count of the unique motivations in the dataframe.

In [None]:
df['motivation'].unique()

 We can also have another look at the first three rows.

In [None]:
 df[:3]

As we can see above, we now have a dataframe that contains only our transcription data. However, many of the columns, such as the body, still contain nested JSON values. We will need to process these columns further to extract the data we specifically require. This is where reference to the [LibCrowds Data Model](https://docs.libcrowds.com/data/model/) becomes necessary in order to understand how these nested fields are structured. As the example annotation towards the top of these page shows, our describing annotations contain two bodies. One body contains a tag that indicates the entity being transcribed (e.g. the title) while the other contains the transcription as plain text. In the annotation and now our dataframe, these two bodies are stored in a list within the **body** column.

It would be useful to extract the values of these two bodies and add them to two additional columns in the dataframe, called **tag** and **transcription**. To do this we will write two functions that check a list of bodies and return the value from the one with the purpose we require.

Given a list of bodies, the function below will return the value from the body with the *tagging* purpose.

In [None]:
def get_tag(bodies):
    for body in bodies:
        if body['purpose'] == 'tagging':
            return body['value']

We can then apply this function to every item in the body column of our dataframe, outputting the results to a new column, called **tag**.

In [None]:
df['tag'] = df['body'].apply(get_tag)

We will write a similar function to extract all of the transcriptions.

In [None]:
def get_transcription(bodies):
    for body in bodies:
        if body['purpose'] == 'describing':
            return body['value']

Then, apply it to every item in the body column and output the results to a new column, called **transcription**.

In [None]:
df['transcription'] = df['body'].apply(get_transcription)

The are remaining columns in our dataframe that still contains JSON data, such as the **target**. However, as we go on to replicate this dataframe for use in other notebooks we will see that we have a purpose for keeping the target as it is. So, for now, our dataframe manipulation is complete. The first few rows of our final dataset are presented below.

In [None]:
df[:3]

## A first glance at the data

Before wrapping up, it might be interesting to introduce a few more pandas functions. 

The [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. We can try this out by first selecting only those rows with the *genre* **tag** and adding them to a new dataframe.

In [None]:
genre_df = df[df['tag'] == 'genre']

We can then describe the **transcription** column using the following code.

In [None]:
genre_df['transcription'].describe()

This simple function already presents some interesting results. At the time of writing, we can see that we have over 200 unique genres, with comedies being the most frequently performed. Similar summaries for *title* or *date* could be produced by entering either in place of *genre*, above.

We might already be curious about what some of the more unusual genres are. To find them, we can use the [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function. This function returns an object containing counts of unique values. 

In the line below we also add the argument `ascending=True` to sort the output in ascending order, then display the first ten rows.

In [None]:
counts = genre_df['transcription'].value_counts(ascending=True)
counts[:10]

In a future notebook we will find out how to visualise such outputs using another Python library, [plotly](https://plot.ly/feed/).

## Summary

In this notebook, we found out how to load all of our results from the *In the Spotlight* crowdsourcing projects into a pandas dataframe. We then manipulated this dataframe to make the key data, such as the transcriptions, more easily accessible. We finished with a basic analysis of the genres discovered so far.

However, each row still only contains a single transcription. To really begin to see the benefit of the programmatic analysis methods introduced here we need some way to view all information about each play in a separate row. We can then begin to pull in additional data from other sources, such as the place names stored in the [IIIF](http://iiif.io/) manifests from which the original crowdsourcing tasks were built. With this additional information we can start to answer questions such as:

- Are there any trends be found for the genres of plays performed at particular periods or across different regions?
- How likely were plays to travel between regions?
- Were some periods more innovative than others, with a greater variety of plays performed?

In the notebook [*Building A Dataframe of Plays*](intro_to_visualising_libcrowds_data_using_python.ipynb) we will see how to build on the work started here to create another dataframe where each row contains the title, genre and date for a single play.

Alternatively, for a basic introduction to producing visualisations using Python you might like to try the notebook [*An Introduction to Visualising LibCrowds Data Using Python*](intro_to_visualising_libcrowds_data_using_python.ipynb).