## Building a Database of Performances

This one is likely to get a bit more complex.

In [50]:
import pandas

In [51]:
df = pandas.read_json('../data/transcriptions.gz', compression='gzip')

In [52]:
def get_source(target):
    if isinstance(target, dict):
        return target['source']
    else:
        return target

Remove the columns we are no longer interested in.

In [53]:
df = df[['target', 'tag', 'transcription']]

Start with the titles, as these will be our root element.

In [54]:
def get_tag_df(full_df, tag):
    tag_df = full_df[df['tag'] == tag]
    tag_df = tag_df.rename(columns={'transcription': tag})
    tag_df = tag_df.drop('tag', axis=1)
    return tag_df

In [55]:
plays_df = get_tag_df(df, 'title')
plays_df.head()

Unnamed: 0,target,title
10025,{u'source': u'https://api.bl.uk/metadata/iiif/...,The Slave
10026,{u'source': u'https://api.bl.uk/metadata/iiif/...,Black-Eyed Susan
10028,{u'source': u'https://api.bl.uk/metadata/iiif/...,A Day After the Fair
10029,{u'source': u'https://api.bl.uk/metadata/iiif/...,Victorine
10030,{u'source': u'https://api.bl.uk/metadata/iiif/...,Twas I


In [56]:
plays_df['source'] = plays_df['target'].apply(get_source)

In [57]:
dates_df = get_tag_df(df, 'date')
dates_df['source'] = dates_df['target'].apply(get_source)

In [58]:
plays_df = plays_df.merge(dates_df, on='source', how='left')

Rename `target_x` to `title_target`

In [59]:
plays_df = plays_df.rename(columns={'target_x': 'title_target'})

Get the columns we are now interested in.

In [60]:
plays_df = plays_df[['title', 'date', 'source', 'title_target']]

Drop any performaces without a date, we can come back and run this again later when we have that data.

In [61]:
plays_df = plays_df.dropna(subset=['date'])
plays_df.head()

Unnamed: 0,title,date,source,title_target
0,The Slave,1828-03-21,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...
1,Black-Eyed Susan,1829-12-03,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...
2,A Day After the Fair,1829-12-03,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...
3,Victorine,1833-12-31,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...
4,Twas I,1826-11-28,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...


Also, for now drop any with a partial date. Again, we'll come back and get them later, once the decision has been made about how to handle them.

In [62]:
plays_df = plays_df[plays_df['date'].str.contains('\d{4}-\d{2}-\d{2}')]

In [63]:
plays_df.to_json('../data/performance_titles_dates.gz', compression='gzip')
plays_df.describe()

Unnamed: 0,title,date,source,title_target
count,768,768,768,768
unique,511,346,354,749
top,She Stoops to Conquer,1840-03-07,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...,{u'source': u'https://api.bl.uk/metadata/iiif/...
freq,9,13,13,2


pybossa_tasks_df = pandas.read_pickle('../data/pybossa_tasks.pkl')