# Preparing In the Spotlight results for MARC ingest

One of the key outputs for *In the Spotlight* is to integrate the crowdsourced transcriptions back into the British Library's core discovery systems. One of the ways we do this is to generate a MARC record for each playbill, containing key information such as dates and titles.

This notebook demonstrates how we can use some of the Python libraries [already introduced](intro_to_analysing_its_data_using_python.ipynb) to collate the data for each playbill into a CSV format that can be used as a template for the generation of MARC records.

We begin by importing the required libraries.

In [260]:
import pandas
import datetime
import dateutil

## The datasets

In the notebook [An Introduction to Analysing In the Spotlight Data Using Python](intro_to_analysing_its_data_using_python.ipynb) we imported all of the results data from [*In the Spotlight*](https://www.libcrowds.com/collection/playbills) into a pandas dataframe. Towards the end of the notebook we stored that dataframe to disk; here, we will load it back into memory.

In [261]:
df = pandas.read_json('../data/transcriptions.gz', compression='gzip')

As a reminder of what the dataset looks like, the first few rows are displayed below.

In [262]:
df.head()

Unnamed: 0,body,created,creator,generated,generator,id,motivation,partOf,tag,target,transcription,type
0,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-05-30T18:29:57Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Melo-Drama,Annotation
10,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-05-30T18:36:33Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Asiatic Melo-Dramatic Romance,Annotation
1009,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-03T21:42:23Z,,2018-06-04T09:43:57Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Comedy,Annotation
101,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-01T16:39:55Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,title,{u'source': u'https://api.bl.uk/metadata/iiif/...,Othello Travestie,Annotation
1019,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-03T21:42:23Z,,2018-06-04T09:43:57Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Comedy,Annotation


We will also need to load some additional metadata for each volume. Specifically, various fields in the MARC records require place data. Each volume can be identified by the IIIF manifest URI, which is shown in the table above under **partOf**. We could use this URI to download each manifest and parse it in an attempt to identify place data. However, due to possible variations in the way this place data is stored and differences in the ways we might need to represent it in the MARC records (e.g. including countries as well as cities), it is more reliable to store a seperate map of manifest URIs against the additional metadata required.

Below, this additional metadata is loaded into another dataframe and indexed by manifest URI.

In [263]:
volume_metadata_df = pandas.read_csv('../metadata/volume.csv')
volume_metadata_df.set_index('manifest_uri', inplace=True, verify_integrity=True)
volume_metadata_df.head()

Unnamed: 0_level_0,system_number,title,theatre,city,country
manifest_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022588857.0x000002/manifest.json,16661350,Covent Garden Theatre 1753-1779 (Vol. 1),Covent Garden Theatre,London,England
https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022588873.0x000002/manifest.json,16661350,Covent Garden Theatre 1753-1779 (Vol. 2),Covent Garden Theatre,London,England
https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022588789.0x000002/manifest.json,16661351,Covent Garden Theatre 1779-1781,Covent Garden Theatre,London,England
https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022588775.0x000002/manifest.json,16661352,Covent Garden Theatre 1781-1783,Covent Garden Theatre,London,England
https://api.bl.uk/metadata/iiif/ark:/81055/vdc_100022588781.0x000002/manifest.json,16661353,Covent Garden Theatre 1783-1785,Covent Garden Theatre,London,England


## Static fields

There are a number of fields that will remain the same for each row of the final CSV and any subsequently created MARC records. The following function returns a dictionary that will be used as a basis for each row.

In [264]:
def get_static_fields():
    return {
        'Aleph system number (001)': '',  # Will become a lookup after initial ingest
        'Country (008/15-17)': 'enk',
        'Form of item (008/23)': 's',
        'Language (008/35-37)': 'eng',
        'Place of manufacture (264 $a)': '[London]',
        'Manufacturer (264 $b)': '[British Library]',
        'Date of manufacture (264 $c)': '[{}]'.format(datetime.datetime.now().year),
        'Extent (300 $a)': '1 online resource',
        'Content type term (336 $a)': 'text',
        'Source (336 $2)': 'rdacontent',
        'Media type term (337 $a)': 'computer',
        'Source (337 $2)': 'rdamedia',
        'Carrier type term (338 $a)': 'online resource',
        'Source (338 $2)': 'rdacarrier',
        'Corporate name or jurisdiction name as entry element (710 $a)': 'British Library Playbills Project',
        'Relator term (710 $e)': 'manufacturer',
        'Link text (856 $y)': 'digitised sheet'
    }

## Helper functions



In [265]:
def get_transcriptions_by_tag(group_df, tag):
    transcriptions_df = group_df[group_df['motivation'] == 'describing']
    transcriptions_df = transcriptions_df[transcriptions_df['tag'] == tag]
    return transcriptions_df['transcription'].tolist()

In [266]:
def get_performance_timestamp(group_df):
    transcriptions = get_transcriptions_by_tag(group_df, 'date')
    if not transcriptions:
        return None
    
    # We should only have one date for each sheet
    elif len(transcriptions) > 1:
        raise ValueError('Multiple dates found')
    
    # Skip until we determine how to handle partial dates
    elif len(transcriptions[0]) < 10:
        return None
    
    ts = dateutil.parser.parse(transcriptions[0], yearfirst=True)
    return ts

## Title fields

TODO: explain!

In [267]:
def get_title_fields(group_df, theatre):
    transcriptions = get_transcriptions_by_tag(group_df, 'title')
    if not transcriptions:
        return {}
    
    joined_titles = u'; '.join(transcriptions)
    out = {
        'Devised title (245 $a)': u'[{0} playbill for {1}]'.format(theatre, joined_titles),
        'Title statement of original (534 $t)': u'[Playbill for {0}]'.format(joined_titles)
    }
    
    for i, value in enumerate(transcriptions):
        out['Other title - {} (246 $a)'.format(i + 1)] = value

    return out

## Date fields

TODO: explain!

In [268]:
def get_date_fields(group_df):
    ts = get_performance_timestamp(group_df)
    if not ts:
        return {}
    
    return {
        'Date/Time of an Event (033 $a)': ts.year + ts.month + ts.day
    }

## Genre fields

TODO: explain!

In [269]:
def get_genre_fields(group_df, country, city):
    transcriptions = get_transcriptions_by_tag(group_df, 'genre')
    if not transcriptions:
        return {}
    
    ts = get_performance_timestamp(group_df)
    out = {}
    for i, value in enumerate(transcriptions):
        out.update({
            'Topical term or geographic name entry element - {} (650 $a)'.format(i + 1): value,
            'Geographic subdivision - {} (650 $z)'.format(i + 1): country,
            'Geographic subdivision - {} (650 $z)'.format(i + 1): city
        })
        
        if ts:
            out['Chronological subdivision - {} (650 $y)'.format(i + 1)] = ts.year
    return out

## Normalising the transcription targets

The targets of our transcription annotatations can be stored in two different ways...fragments and entire sheet...definte a function to normalise the source

In [270]:
def get_source(target):
    if isinstance(target, dict):
        return target['source']
    return target

...and add to another column.

In [271]:
df['source'] = df['target'].apply(get_source)
df.head()

Unnamed: 0,body,created,creator,generated,generator,id,motivation,partOf,tag,target,transcription,type,source
0,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-05-30T18:29:57Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Melo-Drama,Annotation,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
10,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-05-30T18:36:33Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Asiatic Melo-Dramatic Romance,Annotation,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
1009,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-03T21:42:23Z,,2018-06-04T09:43:57Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Comedy,Annotation,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
101,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-01T16:39:55Z,,2018-06-04T09:43:56Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,title,{u'source': u'https://api.bl.uk/metadata/iiif/...,Othello Travestie,Annotation,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
1019,"[{u'type': u'TextualBody', u'purpose': u'descr...",2018-06-03T21:42:23Z,,2018-06-04T09:43:57Z,"[{u'homepage': u'https://www.libcrowds.com', u...",https://annotations.libcrowds.com/annotations/...,describing,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,genre,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,Comedy,Annotation,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...


## Generating the final CSV

We can now generate our final CSV...

In [272]:
grouped = df.groupby('source', as_index=False)

out_data = []
for source, group_df in grouped:
    
    # Get volume metadata
    manifest_uri = group_df['partOf'].tolist()[0]
    volume_md = volume_metadata_df.loc[manifest_uri].to_dict()
    
    date_fields = get_date_fields(group_df)
    title_fields = get_title_fields(group_df, volume_md['theatre'])
    genre_fields = get_genre_fields(group_df, volume_md['country'], volume_md['city'])
    
    # Skip rows without titles or dates for now
    if not all(bool(d) for d in [title_fields, date_fields]):
        continue
    
    row = get_static_fields()
    row.update(title_fields)
    row.update(date_fields)
    row.update(genre_fields)
    out_data.append(row)

out_df = pandas.DataFrame(out_data)
out_df.head()

Unnamed: 0,Aleph system number (001),Carrier type term (338 $a),Chronological subdivision - 1 (650 $y),Chronological subdivision - 2 (650 $y),Chronological subdivision - 3 (650 $y),Chronological subdivision - 4 (650 $y),Chronological subdivision - 5 (650 $y),Chronological subdivision - 6 (650 $y),Chronological subdivision - 7 (650 $y),Content type term (336 $a),...,Source (337 $2),Source (338 $2),Title statement of original (534 $t),Topical term or geographic name entry element - 1 (650 $a),Topical term or geographic name entry element - 2 (650 $a),Topical term or geographic name entry element - 3 (650 $a),Topical term or geographic name entry element - 4 (650 $a),Topical term or geographic name entry element - 5 (650 $a),Topical term or geographic name entry element - 6 (650 $a),Topical term or geographic name entry element - 7 (650 $a)
0,,online resource,1846.0,1846.0,1846.0,1846.0,,,,text,...,rdamedia,rdacarrier,[Playbill for Oliver Twist; Grand Incidental B...,Burletta,Ballet,Melo-Drama,Melo-Dramatic Spectacle,,,
1,,online resource,1821.0,1821.0,1821.0,,,,,text,...,rdamedia,rdacarrier,"[Playbill for Lowina of Tobolski, Or, the Fata...",Melodrama,Melodrama,Melodrama,,,,
2,,online resource,1846.0,1846.0,1846.0,,,,,text,...,rdamedia,rdacarrier,[Playbill for The Lancers; Romeo and Juliet; O...,Melodrama,Interlude,Tragedy,,,,
3,,online resource,1846.0,1846.0,1846.0,,,,,text,...,rdamedia,rdacarrier,[Playbill for The Pledge; Robert Macaire; Char...,Melodrama,Historical Drama,Drama,,,,
4,,online resource,1847.0,,,,,,,text,...,rdamedia,rdacarrier,[Playbill for A Concert],Concert,,,,,,


In [273]:
out_df.to_csv('../data/marc.csv', encoding='utf-8')