# Loading Farms to Freeways from the API and ro-crate metadata file

The Language Data Commons of Australia (LDaCa) packages all their data collections in an [ro-crate](https://www.researchobject.org/ro-crate/). There is a metadata file called `ro-crate-metadata.json` that comes with every data collection and this is how we can obtain metadata on this collection of research objects.

The metadata file is in the json format, and so we'll be learning how to read a json file in this notebook.

<div class="alert alert-block alert-success">
<b>Skills</b> 
    
<ul>
<li> json file format (see https://en.wikipedia.org/wiki/JSON)</li>
<li> working with dataframes, via pandas</li>
<li> discovering and exploring metadata</li>
<li> extracting ngrams, via textacy</li>
</ul>    
<br>

<b>Skill level:</b> Intermediate
</div>

This notebook uses the library 'requests', as shown in the [Using APIs: Open Australia](https://github.com/Australian-Text-Analytics-Platform/open-australia-api/blob/main/api.ipynb) notebook. If you haven't already familiarised yourself with that notebook, it might be a good idea to do so first.

In [3]:
# Before we begin, let's make sure that we install all the requirements that we need
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting en_core_web_sm
  Using cached en_core_web_sm-3.0.0-py3-none-any.whl
Collecting matplotlib==3.4.3
  Using cached matplotlib-3.4.3-cp39-cp39-manylinux2014_aarch64.whl (9.0 MB)
Collecting requests==2.26
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting pandas==1.3.4
  Using cached pandas-1.3.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.9 MB)
Collecting spacy<4.0.0,>=3.0.0
  Using cached spacy-3.2.1.tar.gz (1.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting textacy==0.11.0
  Using cached textacy-0.11.0-py3-none-any.whl (200 kB)
Collecting python-dotenv>=0.19.2
  Using cached python_dotenv-0.19.2-py2.py3-none-any.whl (17 kB)
Collecting numpy>=1.16
  Using cached numpy-1.21.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.0 MB)
Collec

## Import libraries

Python needs the libraries that will be used by the notebook to be specified before they are used. We do this with the reserved word `import`, as shown below.

In [7]:
import json                       # json library to read json file formats
import pprint                     # Prints in a nice way
import requests                   # Uses the requests library for REST apis
import os                         # Loads operating system libraries

## Variables 

We need to specify the path to the data collection. This `Farms to Freeways` data collection was used with permission by [The University of Western Sydney](https://omeka.uws.edu.au/farmstofreeways/). It was made into an ro-crate by the [LDaCa](https://ardc.edu.au/project/language-data-commons-of-australia-ldaca/) project and it is the data set used here to demonstrate the skills list above.

The variables below refer to the `path` where the collection can be found. There are also variables below that refer to ro-crates as specified in LDaCa profiles, for example all artefacts of importances are called `RepositoryObject`, and when an artefact is linked to, it is done with a `hasFile` keyword in the ro-crate metadata file.
<div class="alert alert-block alert-danger">
Create a file name it vars.env and store your API_KEY. This will be required for downloading any files. Go to <a href='https://oni-demo.text-commons.org'>LDACA website</a> and generate an API Key

Example vars.env:

API_KEY=12345

</div>

In [8]:
# Specify location where collection is
LDACA_API = 'https://oni-demo.text-commons.org/api/data'
COLLECTION_ID = 'arcp://name,farms-to-freeways/root/description'
from dotenv import load_dotenv    # loads environment variables
load_dotenv('vars.env') # load the environment variables located in the vars.env files
API_TOKEN = os.getenv('API_KEY') # store your environment variable in this jupyter notebook
if not API_TOKEN:
    print("Set a variable in the vars.env file and name API_KEY")

In [59]:
# Get the Farms to Freeways ro-crate metadata by passing the arcpId in as a parameter to the get request
params = dict()
params['id'] = COLLECTION_ID

f2f_response = requests.get(LDACA_API, params=params)
metadata = f2f_response.json()

# Inspect the metadata (Currently commented out for brevity)
# metadata 

### ro-crate Profiles

An ro-crate profile is a set of conventions that tell us what elements an ro-crate minimally contains.

These profiles tell us what to expect to find in the data packages. Learn more about them here: https://www.researchobject.org/ro-crate/profiles.html

In [10]:
# Keywords from LDaCa ro-crate profiles
OBJECT_LINKAGE = 'hasFile'
GRAPH = '@graph'
TYPE = '@type'
ID = '@id'

# TYPE values are lists. 
# We define a PRIMARY_OBJECT as a 'RepositoryObject' because that is where the main data is stored 
PRIMARY_OBJECT = 'RepositoryObject'

### Define Variables to Gather Metadata

As suggested above, there are '@types' that define certain objects within the collection, for example there is type called 'Person'. This json object stores information such as 'birthDate' about the 'Person'. In the code block below, we discover all the types stored in this metadata file. 

In [11]:
# Find all types and find types that have linked objects
linked_objects = set()
types = list()
primary_object_types = set()

# Traverse through all the objects in the metadata file
for entity in metadata[GRAPH]:
    my_type = entity[TYPE]
    if type(my_type) == str:
        my_type = [my_type]
    if my_type not in types:
        types.append(my_type)
        
    # [PRIMARY_OBJECT, X] : primary_object_type = X
    if PRIMARY_OBJECT in my_type:
        primary_object_type = [e for e in my_type if e not in [PRIMARY_OBJECT]][0]
        primary_object_types.add(primary_object_type)

        if OBJECT_LINKAGE in entity:
            for x in entity[OBJECT_LINKAGE]:
                filename = x[ID]
                suffix = filename.split('.')[-1]
                if suffix not in linked_objects:
                    linked_objects.add((suffix, primary_object_type))

### Flatten a list of lists using itertools

The variable 'types' above is a list of lists and we will flatten it with itertools

<div class="alert alert-block alert-info">
<b>Python Library: itertools</b> 
    
The itertools library allows you to iterate over lists without having list comprehension lines of explicit "for loops" in your code. A and B are equivalent to the code below.

<br> 
    
   
(A)&emsp; flat_types = [t for sublist in types for t in sublist] 

<br>  

(B) &emsp;flat_types = list()
<br>
&emsp;&emsp;&emsp;for sublist in types: 
<br>
&emsp;&emsp; &emsp;&emsp;   for t in sublist:
<br>
&emsp;&emsp;&emsp; &emsp;&emsp;       flat_types.append(t) </br>
</div>

The following line of code does the following:
    [[a, b], [c, d]] ==> [a, b, c, d]

In [12]:
import itertools     # `types` above is a list of lists and we will flatten it with itertools

flat_types = list(itertools.chain.from_iterable(types))

## Exploring the Metadata

Anytime you work with data, it's always a good idea to inspect it by printing it out.

In [13]:
# Print the variables
# All the types
pprint.pp(sorted(types))

[['ContactPoint'],
 ['Corpus', 'Dataset', 'RepositoryCollection'],
 ['CreateAction'],
 ['CreativeWork'],
 ['Dataset'],
 ['File'],
 ['File', 'OrthographicTranscription'],
 ['GeoCoordinates'],
 ['Organization'],
 ['Person'],
 ['Person', 'RepositoryObject'],
 ['Photographic image', 'RepositoryObject'],
 ['Place'],
 ['PropertyValue'],
 ['RepositoryCollection'],
 ['RepositoryObject', 'Text'],
 ['RepositoryObject', 'TextDialogue'],
 ['SoftwareSourceCode'],
 ['SubCorpus'],
 ['csvw:Column'],
 ['csvw:Schema'],
 ['rdf:Property'],
 ['rdfs:Class']]


In [14]:
# All the unique types
pprint.pp(set(flat_types))

{'ContactPoint',
 'Corpus',
 'CreateAction',
 'CreativeWork',
 'Dataset',
 'File',
 'GeoCoordinates',
 'Organization',
 'OrthographicTranscription',
 'Person',
 'Photographic image',
 'Place',
 'PropertyValue',
 'RepositoryCollection',
 'RepositoryObject',
 'SoftwareSourceCode',
 'SubCorpus',
 'Text',
 'TextDialogue',
 'csvw:Column',
 'csvw:Schema',
 'rdf:Property',
 'rdfs:Class'}


In [15]:
# Types of PRIMARY_OBJECTs ie [PRIMARY_OBJECT, X]. What kinds of Xs do we have?
print(primary_object_types)

{'Text', 'TextDialogue', 'Photographic image', 'Person'}


In [16]:
# All the research artefacts/files their types
pprint.pp(linked_objects)

{('csv', 'TextDialogue'),
 ('jpg', 'Photographic image'),
 ('jpg', 'Text'),
 ('jpg', 'TextDialogue'),
 ('mp3', 'TextDialogue'),
 ('pdf', 'Text'),
 ('pdf', 'TextDialogue')}


## Primary Objects

The primary object types are the ones we may care about, so we will pull them into their own dataframe:

<div class="alert alert-block alert-info">
<b>Python Library: pandas (dataframe)</b> 

<br>    
    
A dataframe is akin to a table -- it is made up of rows and columns. 
    
<br>    
In the block of code below, we are creating a dataframe for each "primary_object_type" ('Person', 'TextDialogue', 'Photographic image', and 'Text')
</div>

In [17]:
import pandas as pd  # this means we will refer to pandas as 'pd' throughout the code

all_data = dict()

# traverse over all of the metadata
for entity in metadata[GRAPH]:
    for t in entity[TYPE]:
        if t in primary_object_types:
            df = pd.json_normalize(entity)
            if t not in all_data:
                all_data[t] = df
            else:
                all_data[t] = pd.concat([all_data[t], df])

In [18]:
print(all_data.keys())

dict_keys(['Text', 'Photographic image', 'Person', 'TextDialogue'])


In [19]:
# Print the first TextDialogue item 
text_dialogue = all_data['TextDialogue']

print(text_dialogue.iloc[0])

@id                                                      #interview-#427
@type                                   [RepositoryObject, TextDialogue]
name                                     Interview with Patricia Colless
hasFile                [{'@id': 'https://oni-demo.text-commons.org/ap...
dateCreated                                                   1991-09-12
interviewer                                             Robyn Arrowsmith
publisher                                   University of Western Sydney
description            Patricia Colless was born on 1st April, 1922, ...
speaker.@id                                                         #568
contentLocation.@id    http://omeka.uws.edu.au/farmstofreeways/api/ge...
Name: 0, dtype: object


In [20]:
# Print the primary object Person with @id #568
person = all_data['Person']

person.loc[person['@id'] == '#568']

Unnamed: 0,@id,@type,birthDate,identifier,name,description,birthPlace,address,isPrimaryTopicOf.@id,authorOf.@id,relatedLink.@id
0,#568,"[Person, RepositoryObject]",1922,4 _Person,Patricia Colless,"Patricia Colless was born on 1st April, 1922, ...",Croydon,Penrith,#339,,


In [21]:
new = pd.merge(left=all_data['TextDialogue'], right=all_data['Person'], left_on="speaker.@id", right_on="@id",
               suffixes=('_artefact', '_speaker'), how='inner')

# Print the new dataframe
new

Unnamed: 0,@id_artefact,@type_artefact,name_artefact,hasFile,dateCreated,interviewer,publisher,description_artefact,speaker.@id,contentLocation.@id,...,@type_speaker,birthDate,identifier,name_speaker,description_speaker,birthPlace,address,isPrimaryTopicOf.@id,authorOf.@id,relatedLink.@id
0,#interview-#427,"[RepositoryObject, TextDialogue]",Interview with Patricia Colless,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-09-12,Robyn Arrowsmith,University of Western Sydney,"Patricia Colless was born on 1st April, 1922, ...",#568,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1922,4 _Person,Patricia Colless,"Patricia Colless was born on 1st April, 1922, ...",Croydon,Penrith,#339,,
1,#interview-#428,"[RepositoryObject, TextDialogue]",Interview with Rita Camilleri,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-10-18,Robyn Arrowsmith,University of Western Sydney,Rita Camilleri was born in Malta and arrived i...,#569,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",c 1924,11 _Person,Rita Camilleri,Rita Camilleri was born in Malta and arrived i...,Malta,Pendle Hill,#428,,
2,#interview-#430,"[RepositoryObject, TextDialogue]",Interview with Judith Eastwell,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-03-05,Robyn Arrowsmith,University of Western Sydney,"Judith Eastwell was born on 14th September, 19...",#571,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1945,33 _Person,Judith Eastwell,"Judith Eastwell was born on 14th September, 19...",Not stated,Quakers Hill,#343,,
3,#interview-#431,"[RepositoryObject, TextDialogue]",Interview with Florence Gibbons,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-10-28,Robyn Arrowsmith,University of Western Sydney,"Florence Gibbons was born on 11th September, 1...",#572,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1908,14 _Person,Florence Gibbons,"Florence Gibbons was born on 11th September, 1...",Regentville,Penrith,#345,,
4,#interview-#432,"[RepositoryObject, TextDialogue]",Interview with Iris Hanna,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-12-11,Robyn Arrowsmith,University of Western Sydney,"Iris Hanna was born on 30th November, 1914, at...",#573,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1914,24 _Person,Iris Hanna,"Iris Hanna was born on 30th November, 1914, at...",Lismore,Plumpton,#348,,
5,#interview-#433,"[RepositoryObject, TextDialogue]",Interview with Betty Hargreaves,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-10-17,Robyn Arrowsmith,University of Western Sydney,"Betty Hargreaves was born on 2nd January, 1918...",#574,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1918,10 _Person,Betty Hargreaves,"Betty Hargreaves was born on 2nd January, 1918...",Neutral Bay,Penrith,#351,,
6,#interview-#434,"[RepositoryObject, TextDialogue]",Interview with Marjorie Heath,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-08-28,Robyn Arrowsmith,University of Western Sydney,"Marjorie Heath was born on 28th January, 1926....",#575,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1926,1 _Person,Marjorie Heath,"Marjorie Heath was born on 28th January, 1926....",Not stated,Blacktown,#353,,
7,#interview-#435,"[RepositoryObject, TextDialogue]",Interview with Amy Jackson,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-12-05,Robyn Arrowsmith,University of Western Sydney,"Amy Jackson was born on 18th July, 1916, at Pe...",#576,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1916,22 _Person,Amy Jackson,"Amy Jackson was born on 18th July, 1916, at Pe...",Penrith,Penrith,#355,,
8,#interview-#436,"[RepositoryObject, TextDialogue]",Interview with Joan Jeffery,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-03-10,Robyn Arrowsmith,University of Western Sydney,"Joan Jeffrey was born on 21st June, 1926, at M...",#577,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1926,36 _Person,Joan Jeffery,"Joan Jeffrey was born on 21st June, 1926, at M...",Mudgee,Blacktown,#436,,
9,#interview-#437,"[RepositoryObject, TextDialogue]",Interview with Mavis Lamrock,[{'@id': 'https://oni-demo.text-commons.org/ap...,1991-10-08,Robyn Arrowsmith,University of Western Sydney,"Mavis Lamrock was born on 11th January, 1913, ...",#578,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1913,8 _Person,Mavis Lamrock,"Mavis Lamrock was born on 11th January, 1913, ...",Wauchope,Emu Plains,#394,#462,


## Statistical Summaries

In the metadata there is a key called "birthDate" which is a string that only has the birth year of the speaker. One of the birthDate values in the metadata has a string value "c 1924", instead of a simply sequence of digits there is, as shown when the list of birthDates are printed

### Birth Year

In [22]:
# Print the birth year of all the interviewees
new.birthDate

0       1922
1     c 1924
2       1945
3       1908
4       1914
5       1918
6       1926
7       1916
8       1926
9       1913
10      1920
11      1927
12      1918
13      1921
14      1920
15      1908
16      1937
17      1913
18      1929
19      1924
20      1929
21      1918
22      1929
23      1924
24      1925
25      1930
26      1911
27      1932
28      1929
29      1908
30      1914
31      1922
32      1926
Name: birthDate, dtype: object

The value "c 1924" needs to be normalised as a regular looking year, ie 4 numbers in a sequence. This can be done by simply only allowing the last 4 characters in any string that is a birthDate. For example if year0 = "1918" and year1 = "c 1924" and we only take the last 4 characters, then year0[-4:] = "1918" and year1[-4:] = "1924".

'birthDate' only has the year listed as a text string (str), therefore we need to convert the birthDate value from str to an integer (int) if we want to do any statistical operations based on birthDate. This conversion can be done by 'type casting', eg, if year is a string that has the value "1924", we simply impose the type int(), such as int(year), so the string year = "1924" => integer, year = 1924, which is then a number (not a string) that can undergo maths operations.

The function int(...) in the following line imposes an integer type conversion. That is, int(x) converts x from whatever type it is into an integer, as long as it makes sense for x to be converted into an integer. For example if x = "abc", then it would be impossible to know what value x would have as an integer. But if a string x = "1924", then the integer would have the value 1924.


In [23]:
# Normalising the birth year and casting them as integers
new['birthDate'] = new['birthDate'].apply(lambda year: int(year[-4:]))
new.birthDate

0     1922
1     1924
2     1945
3     1908
4     1914
5     1918
6     1926
7     1916
8     1926
9     1913
10    1920
11    1927
12    1918
13    1921
14    1920
15    1908
16    1937
17    1913
18    1929
19    1924
20    1929
21    1918
22    1929
23    1924
24    1925
25    1930
26    1911
27    1932
28    1929
29    1908
30    1914
31    1922
32    1926
Name: birthDate, dtype: int64

<div class="alert alert-block alert-info">
<b>Python Library: datetime</b> 

<br>    
    
The library datetime can provide the current date and time and allows us to do calculations over any date and time, such as determining the difference between time zones.
<br>    
</div>

In [24]:
# Import the module
import datetime

# We can calculate the mean (average) age
this_year = datetime.datetime.now().year
# Create a list called 'age' which takes every year in birth_year as y. Then get this_year and minus that
# number from year y and make sure all those numbers are stored in a list, which is why we have [] around the
# whole sequence of instructions below.
age = [this_year - y for y in new.birthDate]

# Print the list of the age of all the speakers if they were all alive today
age

[99,
 97,
 76,
 113,
 107,
 103,
 95,
 105,
 95,
 108,
 101,
 94,
 103,
 100,
 101,
 113,
 84,
 108,
 92,
 97,
 92,
 103,
 92,
 97,
 96,
 91,
 110,
 89,
 92,
 113,
 107,
 99,
 95]

<div class="alert alert-block alert-info">
<b>Python Library: statistics</b> 

<br>    
    
The statistics library provides functions to calculate simple statistics, such as the mean, mode, standard deviation, etc., over numeric data.<br>    
</div>

In [25]:
# Import the module
import statistics

print('== AGE ==')
# Print the mean age, which is the average age of all the speakers
print('MEAN:', statistics.mean(age))
# The mode is the most freqently occur age. That is, there are more speakers of this age than any other.
print('MODE:', statistics.mode(age))
# The median is the middle value if the age of the participants were listed in order.
print('MEDIAN:', statistics.median(age))
# The standard deviation is a statistical metric that gives us an indication of how dispersed the age range of the speakers is
print('STD DEV:', "{:.1f}".format(statistics.stdev(age)))
print()

print('== BIRTH YEAR ==')
# Print the mean, median, mode and standard deviation of the birth year
print('MEAN:', statistics.mean(new.birthDate))
print('MEDIAN:', statistics.median(new.birthDate))
print('MODE:', statistics.mode(new.birthDate))

== AGE ==
MEAN: 99
MODE: 92
MEDIAN: 99
STD DEV: 8.5

== BIRTH YEAR ==
MEAN: 1922
MEDIAN: 1922
MODE: 1929


<div class="alert alert-block alert-info">
<b>The Counter Container</b> 

<br>    
    
The Counter container monitors the number of equivalent elements that have been added to it. Learn more about it here: https://docs.python.org/3/library/collections.html#collections.Counter<br>    
</div>

### Other Metadata Features: Place

There are other metadata columns that require normalising in this dataframe. For example there is a location 'Penrith' as well as 'Kingston, Penrith', and there is a location 'St. Marys' as well as 'St Marys' (no '.'), as shown when you print the 'address' column.

In [26]:
new.address

0                 Penrith
1             Pendle Hill
2            Quakers Hill
3                 Penrith
4                Plumpton
5                 Penrith
6               Blacktown
7                 Penrith
8               Blacktown
9              Emu Plains
10              Mt Druitt
11           Quakers Hill
12              Blacktown
13              Blacktown
14             Emu Plains
15              Mt Druitt
16              Blacktown
17              Blacktown
18             Riverstone
19                Penrith
20             Rooty Hill
21                Penrith
22                Penrith
23              Blacktown
24                Penrith
25              Blacktown
26              St. Marys
27               St Marys
28    Kingswood, Penrith 
29              Blacktown
30              Blacktown
31             Emu Plains
32               St Marys
Name: address, dtype: object

In [27]:
# Let's normalise these locations within the dataframe 'new'
# NOTE 'address' is the where the story told in the interview takes place
new['address'] = new['address'].apply(lambda place: place.split(',')[-1].replace('.', '').strip())
new.address

0          Penrith
1      Pendle Hill
2     Quakers Hill
3          Penrith
4         Plumpton
5          Penrith
6        Blacktown
7          Penrith
8        Blacktown
9       Emu Plains
10       Mt Druitt
11    Quakers Hill
12       Blacktown
13       Blacktown
14      Emu Plains
15       Mt Druitt
16       Blacktown
17       Blacktown
18      Riverstone
19         Penrith
20      Rooty Hill
21         Penrith
22         Penrith
23       Blacktown
24         Penrith
25       Blacktown
26        St Marys
27        St Marys
28         Penrith
29       Blacktown
30       Blacktown
31      Emu Plains
32        St Marys
Name: address, dtype: object

In [28]:
from collections import Counter

place_of_story = new.address
place_of_birth = new.birthPlace

In [29]:
# How many of the interviews talked about a certain location/city/suburb
count_story_place = dict(Counter(new.address))
pprint.pp(count_story_place, sort_dicts=True)

{'Blacktown': 10,
 'Emu Plains': 3,
 'Mt Druitt': 2,
 'Pendle Hill': 1,
 'Penrith': 9,
 'Plumpton': 1,
 'Quakers Hill': 2,
 'Riverstone': 1,
 'Rooty Hill': 1,
 'St Marys': 3}


In [30]:
# Count place of birth
count_birth_place = dict(Counter(new.birthPlace))
pprint.pp(count_birth_place, sort_dicts=True)

{'Ballarat': 1,
 'Beelbangera': 1,
 'Blacktown': 2,
 'Camden': 1,
 'Croydon': 1,
 'Dubbo': 1,
 'England': 1,
 'Hornsby': 1,
 'Italy': 1,
 'Kempsey': 1,
 'Lismore': 1,
 'Malta': 1,
 'Mudgee': 1,
 'Neutral Bay': 1,
 'Newcastle': 1,
 'Newtown': 1,
 'Not stated': 3,
 'Parramatta': 1,
 'Penrith': 4,
 'Regentville': 1,
 'Riverstone': 2,
 'St Marys': 1,
 'Sydney': 2,
 'Wauchope': 1,
 'Wellington': 1}


### Cross-cutting 2 features found in the metadata



In [31]:
# Let's try with 1 suburb, where the suburb = Quakers Hill
suburb = new.loc[new['address'] == 'Quakers Hill']
# For all stories set in this suburb, print all the storytellers' birth year 
print(suburb.birthDate)

2     1945
11    1927
Name: birthDate, dtype: int64


In [32]:
# Print place of story and speaker's birth year
all_suburbs = list(count_story_place.keys())
all_suburbs

['Penrith',
 'Pendle Hill',
 'Quakers Hill',
 'Plumpton',
 'Blacktown',
 'Emu Plains',
 'Mt Druitt',
 'Riverstone',
 'Rooty Hill',
 'St Marys']

In [33]:
names = list(suburb.name_speaker)
suburb

Unnamed: 0,@id_artefact,@type_artefact,name_artefact,hasFile,dateCreated,interviewer,publisher,description_artefact,speaker.@id,contentLocation.@id,...,@type_speaker,birthDate,identifier,name_speaker,description_speaker,birthPlace,address,isPrimaryTopicOf.@id,authorOf.@id,relatedLink.@id
2,#interview-#430,"[RepositoryObject, TextDialogue]",Interview with Judith Eastwell,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-03-05,Robyn Arrowsmith,University of Western Sydney,"Judith Eastwell was born on 14th September, 19...",#571,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1945,33 _Person,Judith Eastwell,"Judith Eastwell was born on 14th September, 19...",Not stated,Quakers Hill,#343,,
11,#interview-#439,"[RepositoryObject, TextDialogue]",Interview with Thelma Masters,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-02-26,Robyn Arrowsmith,University of Western Sydney,"Thelma Masters was born on 28th June, 1927. Wh...",#580,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1927,31 _Person,Thelma Masters,"Thelma Masters was born on 28th June, 1927. Wh...",Not stated,Quakers Hill,#439,,


In [34]:
names

['Judith Eastwell', 'Thelma Masters']

In [35]:
place1 = suburb.loc[(suburb['address'] == all_suburbs[2])]
place1

Unnamed: 0,@id_artefact,@type_artefact,name_artefact,hasFile,dateCreated,interviewer,publisher,description_artefact,speaker.@id,contentLocation.@id,...,@type_speaker,birthDate,identifier,name_speaker,description_speaker,birthPlace,address,isPrimaryTopicOf.@id,authorOf.@id,relatedLink.@id
2,#interview-#430,"[RepositoryObject, TextDialogue]",Interview with Judith Eastwell,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-03-05,Robyn Arrowsmith,University of Western Sydney,"Judith Eastwell was born on 14th September, 19...",#571,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1945,33 _Person,Judith Eastwell,"Judith Eastwell was born on 14th September, 19...",Not stated,Quakers Hill,#343,,
11,#interview-#439,"[RepositoryObject, TextDialogue]",Interview with Thelma Masters,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-02-26,Robyn Arrowsmith,University of Western Sydney,"Thelma Masters was born on 28th June, 1927. Wh...",#580,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1927,31 _Person,Thelma Masters,"Thelma Masters was born on 28th June, 1927. Wh...",Not stated,Quakers Hill,#439,,


In [36]:
place2 = suburb.loc[(suburb['address'] == all_suburbs[2]) & (suburb['name_speaker'] == names[0])]
place2

Unnamed: 0,@id_artefact,@type_artefact,name_artefact,hasFile,dateCreated,interviewer,publisher,description_artefact,speaker.@id,contentLocation.@id,...,@type_speaker,birthDate,identifier,name_speaker,description_speaker,birthPlace,address,isPrimaryTopicOf.@id,authorOf.@id,relatedLink.@id
2,#interview-#430,"[RepositoryObject, TextDialogue]",Interview with Judith Eastwell,[{'@id': 'https://oni-demo.text-commons.org/ap...,1992-03-05,Robyn Arrowsmith,University of Western Sydney,"Judith Eastwell was born on 14th September, 19...",#571,http://omeka.uws.edu.au/farmstofreeways/api/ge...,...,"[Person, RepositoryObject]",1945,33 _Person,Judith Eastwell,"Judith Eastwell was born on 14th September, 19...",Not stated,Quakers Hill,#343,,


In [37]:
# Traverse through the suburbs and print the data we are interested in
# In addition, let's save this information in a dictionary called 'suburbs'
# so we don't have to bother with dataframes
suburbs = dict()
for s in all_suburbs:
    suburbs[s] = dict()
    places = new.loc[(new['address'] == s)]
    print('## ===', s, '-- total:', len(places))
    # NOTE: index is the internal reference for the row in the dataframe called 'p'
    for index, i in places.iterrows():
        # initialise each person's info
        person = dict()
        # name
        name = i['name_speaker']
        print(name)
        # birthPlace
        birthPlace = i['birthPlace']
        person['birthPlace'] = birthPlace
        print(birthPlace)
        # birthDate
        birthDate = i['birthDate']
        person['birthDate'] = birthDate
        print(birthDate)        
        # dialogue files
        files = [f['@id'] for f in i['hasFile'] if f['@id'].endswith('.csv')]
        person['files'] = files
        print(files)
        print()
        
        suburbs[s].update({name: person})

## === Penrith -- total: 9
Patricia Colless
Croydon
1922
['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/427/original_bad0fd7f9c918df1db8b6a5b39faec48.csv']

Florence Gibbons
Regentville
1908
['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/431/original_48adb341f8c79e41ef81d9099903e2fc.csv']

Betty Hargreaves
Neutral Bay
1918
['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/433/original_4511c81b7d228cc5b6d0e8045e50c53d.csv']

Amy Jackson
Penrith
1916
['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/435/original_c4f4fa3d973e23b257eac32cb4c0707e.csv']

Mary Pike
Penrith
1924
['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/447/original_1b1dcea2755638d3b3206c4c592638d4.csv']

Olive P

In [38]:
pprint.pprint(suburbs)

{'Blacktown': {'Amelia Vincent': {'birthDate': 1914,
                                  'birthPlace': 'Italy',
                                  'files': ['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/458/original_06694205fd1c7a2f7eacea8646a17d0a.csv']},
               'Clare Pfoeffer': {'birthDate': 1913,
                                  'birthPlace': 'Parramatta',
                                  'files': ['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/445/original_12adc492f3dcf6b71010deb05d86dfca.csv']},
               'Edna Vidler': {'birthDate': 1908,
                               'birthPlace': 'Newtown',
                               'files': ['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/457/original_e25684d39da3cb18d2a2a8dedcab0d1b.csv']},
               'Joan Jeffery': {'birthDate': 192

### Sanity Check

Let's print out the information on 1 person to check that our data is looking as we expect it to.

In [39]:
NAME = 'Amelia Vincent'
PLACE = 'Blacktown'

# Print the whole dict structure for Amelia Vincent
suburbs[PLACE][NAME]

{'birthPlace': 'Italy',
 'birthDate': 1914,
 'files': ['https://oni-demo.text-commons.org/api/data/item?id=arcp://name,farms-to-freeways/root/description&file=files/458/original_06694205fd1c7a2f7eacea8646a17d0a.csv']}

In [40]:
# Print the birthPlace
suburbs[PLACE][NAME]['birthPlace']

'Italy'

In [41]:
# files are a list, so let's create a list of dataframes to save all the contents of each file
dataframes = list()  # we have a list of files so let's save them as a list of dataframes
for f in suburbs[PLACE][NAME]['files']:
    df = pd.read_csv(f, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    dataframes.append(df)

# How many files are there in the list?
len(dataframes)

1

In [42]:
# There is only 1 file in the list. Actually there is only ever 1 file in every files list
dataframes[0]

Unnamed: 0,time,speaker,text,notes
0,,,,INTERVIEW NO. 30 DATE OF INTERVIEW: 21st Febru...
1,0.28,B,My name is Amelia Vincent (nee Tester). I liv...,
2,1.19,A,First of all I'll just ask you a couple of bac...,
3,,B,1922.,
4,,A,And you moved straight to Blacktown when you c...,
...,...,...,...,...
433,,B,"Oh, it was much ... it was hard but it was muc...",
434,,A,OK. Well I think we've just about covered eve...,
435,,B,"Well that's my life story, I tell you.",
436,,A,Unless you can think of anything else you'd pa...,


## Counting BiGrams

Let's use textacy and spacy to process the text from each of the files and count the bigrams.

<div class="alert alert-block alert-info">
<b>Text Processing</b> 
<br>    
<ul>
    <li>textacy: to find bigrams</li>
    <li>spacy: to ingest and process the text</li>
</ul>    
    
<br>    
</div>

In [43]:
import textacy
import spacy

# Load the language model
nlp = spacy.load("en_core_web_sm")

In [44]:
blacktown = list()
print('## == BLACKTOWN')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    blacktown.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(blacktown))

## == BLACKTOWN
Marjorie Heath
	CUMULATIVE TOTAL 0
Joan Jeffery
	CUMULATIVE TOTAL 152
Joyce McKelvey
	CUMULATIVE TOTAL 438
Joyce Moon
	CUMULATIVE TOTAL 719
Patricia Parker
	CUMULATIVE TOTAL 827
Clare Pfoeffer
	CUMULATIVE TOTAL 939
Olga Robshaw
	CUMULATIVE TOTAL 1140
Marie Sing
	CUMULATIVE TOTAL 1363
Edna Vidler
	CUMULATIVE TOTAL 1768
Amelia Vincent
	CUMULATIVE TOTAL 2205


In [45]:
penrith = list()
print('## == Penrith')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    penrith.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(penrith))

## == Penrith
Patricia Colless
	CUMULATIVE TOTAL 148
Florence Gibbons
	CUMULATIVE TOTAL 288
Betty Hargreaves
	CUMULATIVE TOTAL 548
Amy Jackson
	CUMULATIVE TOTAL 946
Mary Pike
	CUMULATIVE TOTAL 1208
Olive Price
	CUMULATIVE TOTAL 1477
Diana Reynolds
	CUMULATIVE TOTAL 1649
Doreen Scott
	CUMULATIVE TOTAL 1850
Marjory Turner
	CUMULATIVE TOTAL 2062


In [46]:
text_b = nlp(' '.join(blacktown))
ngrams_b = list(textacy.extract.basics.ngrams(text_b, 2, min_freq=10))

In [47]:
words_b = [w.text.lower() for w in ngrams_b]

In [48]:
from collections import Counter
cb = Counter(words_b)
cb

Counter({'main street': 20,
         'patrick street': 10,
         'high school': 25,
         'shopping centre': 12,
         'daily routine': 11,
         'school holidays': 10,
         'swimming pool': 10,
         'health services': 14,
         'fuel stove': 14,
         'bowling club': 12,
         'oh yes': 31,
         'sunday school': 10,
         'got married': 16,
         'years ago': 39,
         'seven hills': 18,
         'red cross': 11,
         'early days': 16,
         'market gardens': 13,
         'things like': 11,
         'little bit': 11})

In [49]:
text_p = nlp(' '.join(penrith))
ngrams_p = list(textacy.extract.basics.ngrams(text_p, 2, min_freq=10))
words_p = [w.text.lower() for w in ngrams_p]
cp = Counter(words_p)
cp

Counter({'long time': 15,
         'old home': 10,
         'high street': 52,
         'castlereagh street': 11,
         'high school': 35,
         'henry street': 14,
         'got married': 13,
         'st. marys': 13,
         'fuel stove': 18,
         'primary school': 12,
         'things like': 32,
         'come home': 14,
         'emu plains': 16,
         'governor phillip': 13,
         'oh yes': 24,
         'station street': 11,
         'old days': 16,
         'years ago': 24,
         'taken place': 12,
         'penrith area': 10,
         'left school': 10,
         'came home': 13,
         'little bit': 27})

In [50]:
overlapping = [w for w in cp if w in cb]
overlapping

['high school',
 'got married',
 'fuel stove',
 'things like',
 'oh yes',
 'years ago',
 'little bit']

In [51]:
unique_blacktown = [w for w in cb if w not in cp]
unique_blacktown

['main street',
 'patrick street',
 'shopping centre',
 'daily routine',
 'school holidays',
 'swimming pool',
 'health services',
 'bowling club',
 'sunday school',
 'seven hills',
 'red cross',
 'early days',
 'market gardens']

In [52]:
unique_penrith = [w for w in cp if w not in cb]
unique_penrith

['long time',
 'old home',
 'high street',
 'castlereagh street',
 'henry street',
 'st. marys',
 'primary school',
 'come home',
 'emu plains',
 'governor phillip',
 'station street',
 'old days',
 'taken place',
 'penrith area',
 'left school',
 'came home']

In [53]:
birth_year_blacktown = [suburbs['Blacktown'][name]['birthDate'] for name in suburbs['Blacktown']]
birth_year_blacktown.sort()
birth_year_blacktown

[1908, 1913, 1914, 1918, 1921, 1924, 1926, 1926, 1930, 1937]

In [54]:
birth_year_penrith = [suburbs['Penrith'][name]['birthDate'] for name in suburbs['Penrith']]
birth_year_penrith.sort()
birth_year_penrith

[1908, 1916, 1918, 1918, 1922, 1924, 1925, 1929, 1929]

### Counting n-grams for Suburb given Birth Year

In [55]:
blacktown_pre20s = list()
print('## == BLACKTOWN PRE-1920')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    if person_data['birthDate'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        blacktown_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(blacktown_pre20s))

## == BLACKTOWN PRE-1920
Joyce McKelvey
	CUMULATIVE TOTAL 286
Clare Pfoeffer
	CUMULATIVE TOTAL 398
Edna Vidler
	CUMULATIVE TOTAL 803
Amelia Vincent
	CUMULATIVE TOTAL 1240


In [56]:
penrith_pre20s = list()
print('## == PENRITH PRE-1920')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    if person_data['birthDate'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        penrith_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(penrith_pre20s))

## == PENRITH PRE-1920
Florence Gibbons
	CUMULATIVE TOTAL 140
Betty Hargreaves
	CUMULATIVE TOTAL 400
Amy Jackson
	CUMULATIVE TOTAL 798
Olive Price
	CUMULATIVE TOTAL 1067


In [57]:
text_b20 = nlp(' '.join(blacktown_pre20s))
ngrams_b20 = list(textacy.extract.basics.ngrams(text_b20, 2, min_freq=10))
words_b20 = [w.text.lower() for w in ngrams_b20]
cb20 = Counter(words_b20)
cb20

Counter({'main street': 10,
         'fuel stove': 13,
         'got married': 11,
         'years ago': 34,
         'seven hills': 18,
         'early days': 14,
         'oh yes': 16})

In [58]:
text_p20 = nlp(' '.join(penrith_pre20s))
ngrams_p20 = list(textacy.extract.basics.ngrams(text_p20, 2, min_freq=10))
words_p20 = [w.text.lower() for w in ngrams_p20]
cp20 = Counter(words_p20)
cp20

Counter({'henry street': 10,
         'fuel stove': 12,
         'high school': 16,
         'high street': 16,
         'oh yes': 13,
         'years ago': 12,
         'old days': 10,
         'things like': 16})

<div class="alert alert-block alert-success">
<b>You Can Extend this Notebook</b> 
    
<ul>
<li> You change this notebook studying different suburbs.</li>
<li> Rather than examining the vocabulary of those born before 1920, you can look at the stories of those who were born later.</li>
<li> Try looking at unigrams or trigrams instead of bigrams.</li>
<li> The minimum frequency of bigrams was 10. You can increase or decrease this threshold.</li>
</ul>    
<br>
</div>