# Loading Farms to Freeways from the API and ro-crate metadata file

The Language Data Commons of Australia (LDaCa) packages all their data collections in an [ro-crate](https://www.researchobject.org/ro-crate/). There is a metadata file called `ro-crate-metadata.json` that comes with every data collection and this is how we can obtain metadata on this collection of research objects.

The metadata file is in the json format, and so we'll be learning how to read a json file in this notebook.

<div class="alert alert-block alert-success">
<b>Skills</b> 
    
<ul>
<li> json file format (see https://en.wikipedia.org/wiki/JSON)</li>
<li> working with dataframes, via pandas</li>
<li> discovering and exploring metadata</li>
<li> extracting ngrams, via textacy</li>
</ul>    
<br>

<b>Skill level:</b> Intermediate
</div>

This notebook uses the library 'requests', as shown in the [Using APIs: Open Australia](https://github.com/Australian-Text-Analytics-Platform/open-australia-api/blob/main/api.ipynb) notebook. If you haven't already familiarised yourself with that notebook, it might be a good idea to do so first.

In [None]:
# Before we begin, let's make sure that we install all the requirements that we need
import sys
!{sys.executable} -m pip install -r requirements.txt

## Import libraries

Python needs the libraries that will be used by the notebook to be specified before they are used. We do this with the reserved word `import`, as shown below.

In [2]:
import json                       # json library to read json file formats
import pprint                     # Prints in a nice way
import requests                   # Uses the requests library for REST apis
import os                         # Loads operating system libraries

## Variables 

We need to specify the path to the data collection. This `Farms to Freeways` data collection was used with permission by [The University of Western Sydney](https://omeka.uws.edu.au/farmstofreeways/). It was made into an ro-crate by the [LDaCa](https://ardc.edu.au/project/language-data-commons-of-australia-ldaca/) project and it is the data set used here to demonstrate the skills list above.

The variables below refer to the `path` where the collection can be found. There are also variables below that refer to ro-crates as specified in LDaCa profiles, for example all artefacts of importances are called `RepositoryObject`, and when an artefact is linked to, it is done with a `hadFile` keyword in the ro-crate metadata file.

Create a file name it .env and store your API_KEY. This will be required for downloading any files. Go to [LDACA website](https://oni-demo.text-commons.org) and generate an API Key

Example:
```
API_KEY=12345
```

In [3]:
# Specify location where collection is
LDACA_API = 'https://oni-demo.text-commons.org/api/data'
COLLECTION_ID = 'arcp://name,farms-to-freeways/root/description'
from dotenv import load_dotenv    # loads environment variables
load_dotenv() # load the environment variables located in the .env files
API_TOKEN = os.getenv('API_KEY') # store your environment variable in this jupyter notebook
if not API_TOKEN:
    print("Set a variable in the .env file and name API_KEY")

In [96]:
# Get the Farms to Freeways ro-crate metadata by passing the arcpId in as a parameter to the get request
params = dict()
params['id'] = COLLECTION_ID

f2f_response = requests.get(LDACA_API, params=params)
metadata = f2f_response.json()

# Inspect the metadata
metadata

{'@context': 'https://w3id.org/ro/crate/1.1/context',
 '@graph': [{'@type': 'CreativeWork',
   '@id': 'ro-crate-metadata.json',
   'identifier': 'ro-crate-metadata.json',
   'conformsTo': [{'@id': 'https://w3id.org/ro/crate/1.1'},
    {'@id': 'https://github.com/Language-Research-Technology/ro-crate-profile#Collection'}],
   'about': {'@id': './'}},
  {'identifier': ['arcp://name,farms-to-freeways/root/description',
    {'@id': '_:local-id:ATAP:arcp://name,farms-to-freeways/root/description'}],
   'name': 'Farms to Freeways Example Dataset',
   'description': 'This data set was exported from an Omeka Repository as an example of a DataCrate. It contains the Collections and Items from the repository but does NOT have the exhibitions. The DOI resolves to an archive of the data elsewhere',
   'publisher': {'@id': 'http://westernsydney.edu.au'},
   'datePublished': '2015-12-01',
   'contactPoint': {'@id': 'K.Trewin@westernsydney.edu.au'},
   '@type': ['Corpus', 'Dataset', 'RepositoryCollect

### ro-crate Profiles

An ro-crate profile is a set of conventions that tell us what elements an ro-crate minimally contains.

These profiles tell us what to expect to find in the data packages. Learn more about them here: https://www.researchobject.org/ro-crate/profiles.html

In [None]:
# Keywords from LDaCa ro-crate profiles
OBJECT_LINKAGE = 'hasFile'
GRAPH = '@graph'
TYPE = '@type'
ID = '@id'

# TYPE values are lists. 
# We define a PRIMARY_OBJECT as a 'RepositoryObject' because that is where the main data is stored 
PRIMARY_OBJECT = 'RepositoryObject'

### Define Variables to Gather Metadata

As suggested above, there are '@types' that define certain objects within the collection, for example there is type called 'Person'. This json object stores information such as 'birthDate' about the 'Person'. In the code block below, we discover all the types stored in this metadata file. 

In [None]:
# Find all types and find types that have linked objects
linked_objects = set()
types = list()
primary_object_types = set()

# Traverse through all the objects in the metadata file
for entity in metadata[GRAPH]:
    my_type = entity[TYPE]
    if type(my_type) == str:
        my_type = [my_type]
    if my_type not in types:
        types.append(my_type)
        
    # [PRIMARY_OBJECT, X] : primary_object_type = X
    if PRIMARY_OBJECT in my_type:
        primary_object_type = [e for e in my_type if e not in [PRIMARY_OBJECT]][0]
        primary_object_types.add(primary_object_type)

        if OBJECT_LINKAGE in entity:
            for x in entity[OBJECT_LINKAGE]:
                filename = x[ID]
                suffix = filename.split('.')[-1]
                if suffix not in linked_objects:
                    linked_objects.add((suffix, primary_object_type))

### Flatten a list of lists using itertools

The variable 'types' above is a list of lists and we will flatten it with itertools

<div class="alert alert-block alert-info">
<b>Python Library: itertools</b> 
    
The itertools library allows you to iterate over lists without having list comprehension lines of explicit "for loops" in your code. A and B are equivalent to the code below.

<br> 
    
   
(A)&emsp; flat_types = [t for sublist in types for t in sublist] 

<br>  

(B) &emsp;flat_types = list()
<br>
&emsp;&emsp;&emsp;for sublist in types: 
<br>
&emsp;&emsp; &emsp;&emsp;   for t in sublist:
<br>
&emsp;&emsp;&emsp; &emsp;&emsp;       flat_types.append(t) </br>
</div>

The following line of code does the following:
    [[a, b], [c, d]] ==> [a, b, c, d]

In [None]:
import itertools     # `types` above is a list of lists and we will flatten it with itertools

flat_types = list(itertools.chain.from_iterable(types))

## Exploring the Metadata

Anytime you work with data, it's always a good idea to inspect it by printing it out.

In [None]:
# Print the variables
# All the types
pprint.pp(sorted(types))

In [None]:
# All the unique types
pprint.pp(set(flat_types))

In [None]:
# Types of PRIMARY_OBJECTs ie [PRIMARY_OBJECT, X]. What kinds of Xs do we have?
print(primary_object_types)

In [None]:
# All the research artefacts/files their types
pprint.pp(linked_objects)

## Primary Objects

The primary object types are the ones we may care about, so we will pull them into their own dataframe:

<div class="alert alert-block alert-info">
<b>Python Library: pandas (dataframe)</b> 

<br>    
    
A dataframe is akin to a table -- it is made up of rows and columns. 
    
<br>    
In the block of code below, we are creating a dataframe for each "primary_object_type" ('Person', 'TextDialogue', 'Photographic image', and 'Text')
</div>

In [None]:
import pandas as pd  # this means we will refer to pandas as 'pd' throughout the code

all_data = dict()

# traverse over all of the metadata
for entity in metadata[GRAPH]:
    for t in entity[TYPE]:
        if t in primary_object_types:
            df = pd.json_normalize(entity)
            if t not in all_data:
                all_data[t] = df
            else:
                all_data[t] = pd.concat([all_data[t], df])

In [None]:
print(all_data.keys())

In [None]:
# Print the first TextDialogue item 
text_dialogue = all_data['TextDialogue']

print(text_dialogue.iloc[0])

In [None]:
# Print the primary object Person with @id #568
person = all_data['Person']

person.loc[person['@id'] == '#568']

In [None]:
new = pd.merge(left=all_data['TextDialogue'], right=all_data['Person'], left_on="speaker.@id", right_on="@id",
               suffixes=('_artefact', '_speaker'), how='inner')

# Print the new dataframe
new

## Statistical Summaries

In the metadata there is a key called "birthDate" which is a string that only has the birth year of the speaker. One of the birthDate values in the metadata has a string value "c 1924", instead of a simply sequence of digits there is, as shown when the list of birthDates are printed

### Birth Year

In [None]:
# Print the birth year of all the interviewees
new.birthDate

The value "c 1924" needs to be normalised as a regular looking year, ie 4 numbers in a sequence. This can be done by simply only allowing the last 4 characters in any string that is a birthDate. For example if year0 = "1918" and year1 = "c 1924" and we only take the last 4 characters, then year0[-4:] = "1918" and year1[-4:] = "1924".

'birthDate' only has the year listed as a text string (str), therefore we need to convert the birthDate value from str to an integer (int) if we want to do any statistical operations based on birthDate. This conversion can be done by 'type casting', eg, if year is a string that has the value "1924", we simply impose the type int(), such as int(year), so the string year = "1924" => integer, year = 1924, which is then a number (not a string) that can undergo maths operations.

The function int(...) in the following line imposes an integer type conversion. That is, int(x) converts x from whatever type it is into an integer, as long as it makes sense for x to be converted into an integer. For example if x = "abc", then it would be impossible to know what value x would have as an integer. But if a string x = "1924", then the integer would have the value 1924.


In [None]:
# Normalising the birth year and casting them as integers
new['birthDate'] = new['birthDate'].apply(lambda year: int(year[-4:]))
new.birthDate

<div class="alert alert-block alert-info">
<b>Python Library: datetime</b> 

<br>    
    
The library datetime can provide the current date and time and allows us to do calculations over any date and time, such as determining the difference between time zones.
<br>    
</div>

In [None]:
# Import the module
import datetime

# We can calculate the mean (average) age
this_year = datetime.datetime.now().year
# Create a list called 'age' which takes every year in birth_year as y. Then get this_year and minus that
# number from year y and make sure all those numbers are stored in a list, which is why we have [] around the
# whole sequence of instructions below.
age = [this_year - y for y in new.birthDate]

# Print the list of the age of all the speakers if they were all alive today
age

<div class="alert alert-block alert-info">
<b>Python Library: statistics</b> 

<br>    
    
The statistics library provides functions to calculate simple statistics, such as the mean, mode, standard deviation, etc., over numeric data.<br>    
</div>

In [None]:
# Import the module
import statistics

print('== AGE ==')
# Print the mean age, which is the average age of all the speakers
print('MEAN:', statistics.mean(age))
# The mode is the most freqently occur age. That is, there are more speakers of this age than any other.
print('MODE:', statistics.mode(age))
# The median is the middle value if the age of the participants were listed in order.
print('MEDIAN:', statistics.median(age))
# The standard deviation is a statistical metric that gives us an indication of how dispersed the age range of the speakers is
print('STD DEV:', "{:.1f}".format(statistics.stdev(age)))
print()

print('== BIRTH YEAR ==')
# Print the mean, median, mode and standard deviation of the birth year
print('MEAN:', statistics.mean(new.birthDate))
print('MEDIAN:', statistics.median(new.birthDate))
print('MODE:', statistics.mode(new.birthDate))

<div class="alert alert-block alert-info">
<b>The Counter Container</b> 

<br>    
    
The Counter container monitors the number of equivalent elements that have been added to it. Learn more about it here: https://docs.python.org/3/library/collections.html#collections.Counter<br>    
</div>

### Other Metadata Features: Place

There are other metadata columns that require normalising in this dataframe. For example there is a location 'Penrith' as well as 'Kingston, Penrith', and there is a location 'St. Marys' as well as 'St Marys' (no '.'), as shown when you print the 'address' column.

In [None]:
new.address

In [None]:
# Let's normalise these locations within the dataframe 'new'
# NOTE 'address' is the where the story told in the interview takes place
new['address'] = new['address'].apply(lambda place: place.split(',')[-1].replace('.', '').strip())
new.address

In [None]:
from collections import Counter

place_of_story = new.address
place_of_birth = new.birthPlace

In [None]:
# How many of the interviews talked about a certain location/city/suburb
count_story_place = dict(Counter(new.address))
pprint.pp(count_story_place, sort_dicts=True)

In [None]:
# Count place of birth
count_birth_place = dict(Counter(new.birthPlace))
pprint.pp(count_birth_place, sort_dicts=True)

### Cross-cutting 2 features found in the metadata



In [None]:
# Let's try with 1 suburb, where the suburb = Quakers Hill
suburb = new.loc[new['address'] == 'Quakers Hill']
# For all stories set in this suburb, print all the storytellers' birth year 
print(suburb.birthDate)

In [None]:
# Print place of story and speaker's birth year
all_suburbs = list(count_story_place.keys())
all_suburbs

In [None]:
names = list(suburb.name_speaker)
suburb

In [None]:
names

In [None]:
place1 = suburb.loc[(suburb['address'] == all_suburbs[2])]
place1

In [None]:
place2 = suburb.loc[(suburb['address'] == all_suburbs[2]) & (suburb['name_speaker'] == names[0])]
place2

In [None]:
# Traverse through the suburbs and print the data we are interested in
# In addition, let's save this information in a dictionary called 'suburbs'
# so we don't have to bother with dataframes
suburbs = dict()
for s in all_suburbs:
    suburbs[s] = dict()
    places = new.loc[(new['address'] == s)]
    print('## ===', s, '-- total:', len(places))
    # NOTE: index is the internal reference for the row in the dataframe called 'p'
    for index, i in places.iterrows():
        # initialise each person's info
        person = dict()
        # name
        name = i['name_speaker']
        print(name)
        # birthPlace
        birthPlace = i['birthPlace']
        person['birthPlace'] = birthPlace
        print(birthPlace)
        # birthDate
        birthDate = i['birthDate']
        person['birthDate'] = birthDate
        print(birthDate)        
        # dialogue files
        files = [f['@id'] for f in i['hasFile'] if f['@id'].endswith('.csv')]
        person['files'] = files
        print(files)
        print()
        
        suburbs[s].update({name: person})

In [None]:
pprint.pprint(suburbs)

### Sanity Check

Let's print out the information on 1 person to check that our data is looking as we expect it to.

In [None]:
NAME = 'Amelia Vincent'
PLACE = 'Blacktown'

# Print the whole dict structure for Amelia Vincent
suburbs[PLACE][NAME]

In [None]:
# Print the birthPlace
suburbs[PLACE][NAME]['birthPlace']

In [None]:
# files are a list, so let's create a list of dataframes to save all the contents of each file
dataframes = list()  # we have a list of files so let's save them as a list of dataframes
for f in suburbs[PLACE][NAME]['files']:
    df = pd.read_csv(f, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    dataframes.append(df)

# How many files are there in the list?
len(dataframes)

In [None]:
# There is only 1 file in the list. Actually there is only ever 1 file in every files list
dataframes[0]

## Counting BiGrams

Let's use textacy and spacy to process the text from each of the files and count the bigrams.

<div class="alert alert-block alert-info">
<b>Text Processing</b> 
<br>    
<ul>
    <li>textacy: to find bigrams</li>
    <li>spacy: to ingest and process the text</li>
</ul>    
    
<br>    
</div>

In [None]:
import textacy
import spacy

# Load the language model
nlp = spacy.load("en_core_web_sm")

In [None]:
blacktown = list()
print('## == BLACKTOWN')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    blacktown.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(blacktown))

In [None]:
penrith = list()
print('## == Penrith')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    penrith.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(penrith))

In [None]:
text_b = nlp(' '.join(blacktown))
ngrams_b = list(textacy.extract.basics.ngrams(text_b, 2, min_freq=10))

In [None]:
words_b = [w.text.lower() for w in ngrams_b]

In [None]:
from collections import Counter
cb = Counter(words_b)
cb

In [None]:
text_p = nlp(' '.join(penrith))
ngrams_p = list(textacy.extract.basics.ngrams(text_p, 2, min_freq=10))
words_p = [w.text.lower() for w in ngrams_p]
cp = Counter(words_p)
cp

In [None]:
overlapping = [w for w in cp if w in cb]
overlapping

In [None]:
unique_blacktown = [w for w in cb if w not in cp]
unique_blacktown

In [None]:
unique_penrith = [w for w in cp if w not in cb]
unique_penrith

In [None]:
birth_year_blacktown = [suburbs['Blacktown'][name]['birthDate'] for name in suburbs['Blacktown']]
birth_year_blacktown.sort()
birth_year_blacktown

In [None]:
birth_year_penrith = [suburbs['Penrith'][name]['birthDate'] for name in suburbs['Penrith']]
birth_year_penrith.sort()
birth_year_penrith

### Counting n-grams for Suburb given Birth Year

In [None]:
blacktown_pre20s = list()
print('## == BLACKTOWN PRE-1920')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    if person_data['birthDate'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        blacktown_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(blacktown_pre20s))

In [None]:
penrith_pre20s = list()
print('## == PENRITH PRE-1920')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    if person_data['birthDate'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        penrith_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(penrith_pre20s))

In [None]:
text_b20 = nlp(' '.join(blacktown_pre20s))
ngrams_b20 = list(textacy.extract.basics.ngrams(text_b20, 2, min_freq=10))
words_b20 = [w.text.lower() for w in ngrams_b20]
cb20 = Counter(words_b20)
cb20

In [None]:
text_p20 = nlp(' '.join(penrith_pre20s))
ngrams_p20 = list(textacy.extract.basics.ngrams(text_p20, 2, min_freq=10))
words_p20 = [w.text.lower() for w in ngrams_p20]
cp20 = Counter(words_p20)
cp20

<div class="alert alert-block alert-success">
<b>You Can Extend this Notebook</b> 
    
<ul>
<li> You change this notebook studying different suburbs.</li>
<li> Rather than examining the vocabulary of those born before 1920, you can look at the stories of those who were born later.</li>
<li> Try looking at unigrams or trigrams instead of bigrams.</li>
<li> The minimum frequency of bigrams was 10. You can increase or decrease this threshold.</li>
</ul>    
<br>
</div>