# Loading Farms to Freeways from the API and ro-crate metadata file

The Language Data Commons of Australia (LDaCa) packages all their data collections in an [ro-crate](https://www.researchobject.org/ro-crate/). There is a metadata file called `ro-crate-metadata.json` that comes with every data collection and this is how we can obtain metadata on this collection of research objects.

The metadata file is in the json format, and so we'll be learning how to read a json file in this notebook.

<div class="alert alert-block alert-success">
<b>Skills</b> 
    
<ul>
<li> json file format (see https://en.wikipedia.org/wiki/JSON)</li>
<li> working with dataframes, via pandas</li>
<li> discovering and exploring metadata</li>
<li> extracting ngrams, via textacy</li>
</ul>    
<br>

<b>Skill level:</b> Intermediate
</div>

In [None]:
test 4

This notebook uses the library 'requests', as shown in the [Using APIs: Open Australia](https://github.com/Australian-Text-Analytics-Platform/open-australia-api/blob/main/api.ipynb) notebook. If you haven't already familiarised yourself with that notebook, it might be a good idea to do so first.

In [1]:
# Before we begin, let's make sure that we install all the requirements that we need
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting en_core_web_sm
  Using cached en_core_web_sm-3.0.0-py3-none-any.whl
Collecting ldaca@ git+https://github.com/Language-Research-Technology/ldaca-py.git@v0.0.4
  Cloning https://github.com/Language-Research-Technology/ldaca-py.git (to revision v0.0.4) to /private/var/folders/zs/rntfhht92jz6nzrm5qxywfgm0000gn/T/pip-install-y4qgj8zs/ldaca_2d3c3e6452fe48a990a20420d534235c
  Running command git clone -q https://github.com/Language-Research-Technology/ldaca-py.git /private/var/folders/zs/rntfhht92jz6nzrm5qxywfgm0000gn/T/pip-install-y4qgj8zs/ldaca_2d3c3e6452fe48a990a20420d534235c
  Running command git checkout -q cc5c1c890e64faa09909649b0da3d57552761de1
Collecting matplotlib==3.4.3
  Using cached matplotlib-3.4.3.tar.gz (37.9 MB)
Collecting requests==2.26
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting pandas==1.3.4
  Using cached pandas-1.3.4.tar.gz (4.7 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ld

## Import libraries

Python needs the libraries that will be used by the notebook to be specified before they are used. We do this with the reserved word `import`, as shown below.

In [44]:
import json                       # json library to read json file formats
import pprint                     # Prints in a nice way
import requests                   # Uses the requests library for REST apis
import os                         # Loads operating system libraries
from ldaca.ldaca import LDaCA     # Loads the LDaCA ReST api wrapper
from rocrate_lang.utils import as_list # A handy utility for converting to list

## Variables 

We need to specify the path to the data collection. This `Farms to Freeways` data collection was used with permission by [The University of Western Sydney](https://omeka.uws.edu.au/farmstofreeways/). It was made into an ro-crate by the [LDaCa](https://ardc.edu.au/project/language-data-commons-of-australia-ldaca/) project and it is the data set used here to demonstrate the skills list above.

The variables below refer to the `path` where the collection can be found. There are also variables below that refer to ro-crates as specified in LDaCa profiles, for example all artefacts of importances are called `RepositoryObject`, and when an artefact is linked to, it is done with a `hasFile` keyword in the ro-crate metadata file.
<div class="alert alert-block alert-success">
Create a file, name it vars.env and store your API_KEY. This will be required for downloading any files. Go to the <a href='https://data-dev.ldaca.edu.au/'>LDACA website</a> and generate an API Key

Example vars.env:

API_KEY=12345

</div>

In [None]:
# Specify location where collection is
LDACA_API = 'https://data-dev.ldaca.edu.au/api'
COLLECTION_ID = 'arcp://name,farms-to-freeways/corpus/root'
from dotenv import load_dotenv    # loads environment variables
load_dotenv('vars.env') # load the environment variables located in the vars.env files
API_TOKEN = os.getenv('API_KEY') # store your environment variable in this jupyter notebook
if not API_TOKEN:
    print("Set a variable in the vars.env file and name API_KEY")

In [46]:
# Get the Farms to Freeways ro-crate metadata by passing the arcpId in as a parameter to the get request

ldaca = LDaCA(url=LDACA_API, token=API_TOKEN, data_dir='data')
ldaca.retrieve_collection(collection=COLLECTION_ID, collection_type='Collection', data_dir='data')

metadata = ldaca.crate

# Inspect the metadata (Currently commented out for brevity)
metadata

<rocrate_lang.rocrate_plus.ROCratePlus at 0x10ff8ff40>

### ro-crate Profiles

An ro-crate profile is a set of conventions that tell us what elements an ro-crate minimally contains.

These profiles tell us what to expect to find in the data packages. Learn more about them here: https://www.researchobject.org/ro-crate/profiles.html

In [47]:
# TYPE values should be lists. 
# We define a PRIMARY_OBJECT as a 'RepositoryObject' because that is where the main data is stored 
PRIMARY_OBJECT = 'RepositoryObject'

### Define Variables to Gather Metadata

As suggested above, there are '@types' that define certain objects within the collection, for example there is type called 'Person'. This json object stores information such as 'birthDate' about the 'Person'. In the code block below, we discover all the types stored in this metadata file. 

In [223]:
# Find all types and find types that have linked objects
files = set()
types = list()
primary_object_types = list()

# Lets see what we can find in our metadata
for entity in ldaca.crate.contextual_entities + ldaca.crate.data_entities:
    entity_type = as_list(entity.type)  # We make sure that each type is a list
    for e_t in entity_type:
        types.append(e_t)


## Exploring the Metadata

Anytime you work with data, it's always a good idea to inspect it by printing it out.

In [49]:
# Print the variables
# All the types, removing duplicates
list(dict.fromkeys(types))

['OrganizationReuseLicense',
 'ContactPoint',
 'Person',
 'Organization',
 'RepositoryObject',
 'Text',
 'Photographic image',
 'GeoCoordinates',
 'Place',
 'Interview Transcript',
 'Sound',
 'Dataset',
 'RepositoryCollection',
 'PropertyValue',
 'csvw:Schema',
 'csvw:Column',
 'Language',
 'DefinedTerm',
 'SoftwareSourceCode',
 'CreateAction',
 'File',
 'Annotation',
 'PrimaryText']

## Primary Objects

The primary object types are the ones we may care about, so we will pull them into their own dataframe:

We use the special function called `ldaca.crate.dereference(id)` to find out which linked object this is
More on how to consume an ro-crate using python here: https://github.com/ResearchObject/ro-crate-py#consuming-an-ro-crate

In [50]:
# Types of PRIMARY_OBJECTs ie [PRIMARY_OBJECT, X]. What kinds of Xs do we have?
for entity in ldaca.crate.contextual_entities + ldaca.crate.data_entities:
    if 'RepositoryObject' in as_list(entity.type):
        print(entity.get('name'))
        item = ldaca.crate.dereference(entity.id)
        primary_object_types.append(item.as_jsonld())

Western Sydney Women's Oral History Project: Thank-you note from Olive Price
Western Sydney Women's Oral History Project: Flier (illustrated)
Western Sydney Women's Oral History Project: Flier
Western Sydney Women's Oral History Project: Interview question outline
Western Sydney Women's Oral History Project: Sources used for finding informants
Photo of Pat Colless 1
Photo of Pat Colless 2
Photo of Heather Corr 1
Photo of Heather Corr 2
Photo of Judith Eastwell 1
Photo of Judith Eastwell 2
Photo of Florence Gibbons 1
Photo of Florence Gibbons 2
Photo of Iris Hanna 1
Photo of Iris Hanna 2
Photo of Iris Hanna's home
Photo of Betty Hargreaves 1
Photo of Betty Hargreaves 2
Photo of Betty Hargreaves 3
Photo of Marg Heath 1
Photo of Marg Heath 2
Photo of Amy Jackson 2
Photo of Amy Jackson 1
Photo of Mavis Lamrock 1
Photo of Edith Mason
Photo of Joyce Moon
Photo of Brenda Niccol 1
Photo of Brenda Niccol 2
Photo of Pat Parker 1
Photo of Pat Parker 2
Photo of Claire Pfoeffer 1
Photo of Claire Pf

<div class="alert alert-block alert-info">
<b>Python Library: pandas (dataframe)</b> 
<br>    
A dataframe is akin to a table -- it is made up of rows and columns. 
<br>    
In the block of code below, we are creating a dataframe for each "primary_object_type" ('Person', 'TextDialogue', 'Photographic image', and 'Text')
</div>

In [11]:
import pandas as pd  # this means we will refer to pandas as 'pd' throughout the code

primary_objects_dataframe = pd.json_normalize(primary_object_types)
primary_objects_dataframe

Unnamed: 0,@id,@type,description,text,author,publisher,originalFormat,identifier,name,copyrightHolder,...,@reverse.transcriptOf,isPrimaryTopicOf.@id,@reverse.speaker,authorOf.@id,encodingFormat,speaker.@id,conformsTo.@id,language.@id,linguisticGenre.@id,indexableText.@id
0,"arcp://name,farms-to-freeways/collection/weste...","[RepositoryObject, Text]","A short thank-you note, returned to the interv...",Dear Miss Arrowsmith\r\nThank you for your let...,Olive Price,University of Western Sydney,Note card,ftf_thankyou_price,Western Sydney Women's Oral History Project: T...,Western Sydney University,...,,,,,,,,,,
1,"arcp://name,farms-to-freeways/collection/weste...","[RepositoryObject, Text]",Flier (illustrated) seeking participants for t...,"<div style=""text-align:center;"">DID YOU LIVE I...",,University of Western Sydney,Paper,FTF_flier_illust,Western Sydney Women's Oral History Project: F...,Western Sydney University,...,,,,,,,,,,
2,"arcp://name,farms-to-freeways/collection/weste...","[RepositoryObject, Text]",Flier seeking participants for the project.,"<div style=""text-align:center;"">WESTERN SYDNEY...",,University of Western Sydney,Paper,FTF_flier,Western Sydney Women's Oral History Project: F...,Western Sydney University,...,,,,,,,,,,
3,"arcp://name,farms-to-freeways/collection/weste...","[RepositoryObject, Text]",Outline of questions for conducting oral histo...,,,University of Western Sydney,Paper,FTF_questions,Western Sydney Women's Oral History Project: I...,Western Sydney University,...,,,,,,,,,,
4,"arcp://name,farms-to-freeways/collection/weste...","[RepositoryObject, Text]",This document lists organisations and other so...,,,University of Western Sydney,Paper,FTF_sources,Western Sydney Women's Oral History Project: S...,Western Sydney University,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,"arcp://name,farms-to-freeways/interview-item/a...",RepositoryObject,"Marjory Turner was born on the 25th January, 1...",,,University of Western Sydney,,,Interview with Marjory Turner,,...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/marjo...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
196,"arcp://name,farms-to-freeways/interview-item/a...",RepositoryObject,"Edna Vidler was born on 31st December, 1908, a...",,,University of Western Sydney,,,Interview with Edna Vidler,,...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/ednav...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
197,"arcp://name,farms-to-freeways/interview-item/a...",RepositoryObject,"Amelia Vincent was born in 1914, in Italy. Ame...",,,University of Western Sydney,,,Interview with Amelia Vincent,,...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/audre...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
198,"arcp://name,farms-to-freeways/interview-item/a...",RepositoryObject,"Audrey Watson was born on 23rd August, 1922 at...",,,University of Western Sydney,,,Interview with Audrey Watson,,...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/audre...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...


## hasPart

Each RepositoryObject has a list of files in an array called `hasPart`

In [59]:
# Types of File that are in each primary object. What kinds of files do we have?
for entity in primary_object_types:
    if 'hasPart' in entity:
        hasPart = entity.get('hasPart')
        for part in as_list(hasPart):
            file = ldaca.crate.dereference(part.get('@id'))
            files.add(file)
print(f"{len(files)} files")

460 files


## Files

We extracted the files from the primary objects we cared about, now lets filter the CSVs by searching each file of the type `Annotation` and then making sure we get a csv file by testing the `encodingFormat`

In [112]:

annotations = set()
csvs = list()

# pick out the annotation files
for file in files:
    if 'Annotation' in as_list(file.type):
        annotations.add(file)

# from this annotations select only the CSVs

for annotation in annotations:
    if annotation.get('encodingFormat') == 'text/csv':
        annotation_json = annotation.as_jsonld()
        csvs.append(annotation_json)
    
 
print(f"We have {len(csvs)} csv objects")   


We have 34 csv objects


## CSVs

Lets explore the metadata of one csv object

In [113]:
csv = next(iter(csvs))
print(json.dumps(csv, indent=2, sort_keys=False))

{
  "@id": "https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/429/original_301212cc7bd4fa7dd92c08f24f210069.csv",
  "@type": [
    "File",
    "Annotation"
  ],
  "name": "Transcript of interview with Heather Corr  full text transcription (CSV)",
  "encodingFormat": "text/csv",
  "annotationType": [
    {
      "@id": "txc:Transcription"
    },
    {
      "@id": "txc:TimeAligned"
    }
  ],
  "modality": {
    "@id": "txc:Orthography"
  },
  "annotationOf": {
    "@id": "https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/515/original_4b3126b4b7f8eea706b84f536781a01c.mp3"
  },
  "language": {
    "@id": "https://www.ethnologue.com/language/eng"
  },
  "size": 72240,
  "@reverse": {
    "hasPart": [
      {
        "@id": "arcp://name,farms-to-freeways/corpus/root/"
      },
      {
        "@id": "arcp://name,farms-to-freeways/interview-item/arcp://name,farms-to-freeways/collection/transcriptof

## Normalize CSV objects

Lets use pandas json_normalize to create a data frame for speakers

In [118]:
csvs_dataframe = pd.json_normalize(csvs)
    
list(csvs_dataframe)

['@id',
 '@type',
 'name',
 'encodingFormat',
 'annotationType',
 'size',
 'modality.@id',
 'annotationOf.@id',
 'language.@id',
 '@reverse.hasPart',
 '@reverse.hasAnnotation',
 '@reverse.indexableText']

In [120]:
csvs_dataframe.iloc[0]['@id']

'https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/429/original_301212cc7bd4fa7dd92c08f24f210069.csv'

## Speakers

Each RepositoryObject has speakers. Lets find them in a similar way we did with the files

In [56]:
speakers = list()

for entity in primary_object_types:
    if 'speaker' in entity:
        speaker = entity.get('speaker')
        if speaker:
            for person in as_list(speaker):
                speaker_item = ldaca.crate.dereference(person['@id'])
                speakers.append(speaker_item.as_jsonld())
print(f"{len(speakers)} speakers")

34 speakers


## Person
Each speaker is represented by a `Person` object lets explore one

In [55]:
person = next(iter(speakers))
print(json.dumps(person, indent=2, sort_keys=False))

{
  "@id": "arcp://name,farms-to-freeways/collection/patriciaparker",
  "@type": [
    "RepositoryObject",
    "Person"
  ],
  "birthDate": "1937",
  "identifier": "2 _Person",
  "isPrimaryTopicOf": {
    "@id": "arcp://name,farms-to-freeways/collection/photoofpatparker2"
  },
  "name": "Patricia Parker",
  "description": "Patricia Parker was born on 19th August, 1937, at the Paddington Women's Hospital in Sydney. Patricia moved to Blacktown with her husband and three children in the mid-1950s at age 21. Both Patricia and her husband joined the Communist Party, and Patricia was a member of the Union of Australian Women for many years. An interest in Community Arts led her to do a degree in Arts Administration, and she was the first Community Arts Officer to be appointed to Blacktown Council. During the interview, Pat mentions Blacktown Council's involvement in the Section 94 court case which resulted in developers having to contribute towards community facilities and infrastructure.",


## Normalize Speakers

Lets use pandas json_normalize to create a data frame for speakers

In [54]:
speakers_dataframe = pd.json_normalize(speakers)
    
speakers_dataframe   

Unnamed: 0,@id,@type,birthDate,identifier,name,description,birthPlace,address,type,isPrimaryTopicOf.@id,@reverse.hasMember,@reverse.speaker,relatedLink.@id,authorOf.@id
0,"arcp://name,farms-to-freeways/collection/patri...","[RepositoryObject, Person]",1937,2 _Person,Patricia Parker,"Patricia Parker was born on 19th August, 1937,...",Sydney,Blacktown,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
1,"arcp://name,farms-to-freeways/collection/ritac...","[RepositoryObject, Person]",c 1924,11 _Person,Rita Camilleri,Rita Camilleri was born in Malta and arrived i...,Malta,Pendle Hill,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
2,"arcp://name,farms-to-freeways/collection/heath...","[RepositoryObject, Person]",1923,18 _Person,Heather Corr,"Heather Corr was born on 10th April, 1923, at ...",Penrith,Penrith,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
3,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
4,"arcp://name,farms-to-freeways/collection/flore...","[RepositoryObject, Person]",1908,14 _Person,Florence Gibbons,"Florence Gibbons was born on 11th September, 1...",Regentville,Penrith,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
5,"arcp://name,farms-to-freeways/collection/irish...","[RepositoryObject, Person]",1914,24 _Person,Iris Hanna,"Iris Hanna was born on 30th November, 1914, at...",Lismore,Plumpton,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
6,"arcp://name,farms-to-freeways/collection/brend...","[RepositoryObject, Person]",1920,7 _Person,Brenda Niccol,"Brenda Niccol was born on 20th April, 1920, at...",Ballarat,Emu Plains,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
7,"arcp://name,farms-to-freeways/collection/marjo...","[RepositoryObject, Person]",1929,19 _Person,Marjory Turner,"Marjory Turner was born on the 25th January, 1...",Penrith,"Kingswood, Penrith",RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,
8,"arcp://name,farms-to-freeways/collection/audre...","[RepositoryObject, Person]",1922,5 _Person,Audrey Watson,"Audrey Watson was born on 23rd August, 1922 at...",Sydney,Emu Plains,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...","arcp://name,farms-to-freeways/collection/photo...",
9,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...","[{'@id': 'arcp://name,farms-to-freeways/collec...","[{'@id': 'arcp://name,farms-to-freeways/interv...",,


In [224]:
#Lets use 'indexableText.@id' to join the csvs in the primary objects dataframe

primary_objects_dataframe.iloc[195]['indexableText.@id']

'https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/456/original_4666a277569c7da46a184910f24652a3.csv'

In [135]:
df_1 = pd.merge(left=csvs_dataframe, right=primary_objects_dataframe, left_on='@id', right_on="indexableText.@id",
              suffixes=('_csvs', '_po'), how='left')
df_1

Unnamed: 0,@id_csvs,@type_csvs,name_csvs,encodingFormat_csvs,annotationType,size,modality.@id,annotationOf.@id,language.@id_csvs,@reverse.hasPart,...,@reverse.transcriptOf,isPrimaryTopicOf.@id,@reverse.speaker,authorOf.@id,encodingFormat_po,speaker.@id,conformsTo.@id,language.@id_po,linguisticGenre.@id,indexableText.@id
0,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Heather Corr ful...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",72240,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/heath...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
1,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Mary Pike full te...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",59524,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/marjo...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
2,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Marjorie Heath fu...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",105,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/marjo...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
3,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Amy Jackson full ...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",45107,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/audre...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
4,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Audrey Watson ful...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",43887,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/audre...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
5,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Judith Eastwell f...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",58117,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
6,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Doreen Scott full...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",60238,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/doree...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
7,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Sheila Nottley fu...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",41641,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/sheil...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
8,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Marie Sing full t...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",42460,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/marjo...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
9,https://data-dev.ldaca.edu.au/api/stream?id=ar...,"[File, Annotation]",Transcript of interview with Margaret Taylor f...,text/csv,"[{'@id': 'txc:Transcription'}, {'@id': 'txc:Ti...",46317,txc:Orthography,https://data-dev.ldaca.edu.au/api/stream?id=ar...,https://www.ethnologue.com/language/eng,"[{'@id': 'arcp://name,farms-to-freeways/corpus...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/marjo...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...


In [226]:
# Now lets use speaker@id to join with the dataframe we have just created
df_1.iloc[0]['speaker.@id']

'arcp://name,farms-to-freeways/collection/heathercorr'

In [165]:
new = pd.merge(left=speakers_dataframe, right=df_1, left_on='@id', right_on="speaker.@id",
              suffixes=('_speaker', '_artefact'), how='left')
new

Unnamed: 0,@id,@type,birthDate_speaker,identifier_speaker,name,description_speaker,birthPlace_speaker,address_speaker,type_speaker,isPrimaryTopicOf.@id_speaker,...,@reverse.transcriptOf,isPrimaryTopicOf.@id_artefact,@reverse.speaker_artefact,authorOf.@id_artefact,encodingFormat_po,speaker.@id,conformsTo.@id,language.@id_po,linguisticGenre.@id,indexableText.@id
0,"arcp://name,farms-to-freeways/collection/patri...","[RepositoryObject, Person]",1937,2 _Person,Patricia Parker,"Patricia Parker was born on 19th August, 1937,...",Sydney,Blacktown,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/patri...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
1,"arcp://name,farms-to-freeways/collection/patri...","[RepositoryObject, Person]",1937,2 _Person,Patricia Parker,"Patricia Parker was born on 19th August, 1937,...",Sydney,Blacktown,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/patri...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
2,"arcp://name,farms-to-freeways/collection/ritac...","[RepositoryObject, Person]",c 1924,11 _Person,Rita Camilleri,Rita Camilleri was born in Malta and arrived i...,Malta,Pendle Hill,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/ritac...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
3,"arcp://name,farms-to-freeways/collection/heath...","[RepositoryObject, Person]",1923,18 _Person,Heather Corr,"Heather Corr was born on 10th April, 1923, at ...",Penrith,Penrith,RepositoryObject,"arcp://name,farms-to-freeways/collection/photo...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/heath...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
4,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
106,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
107,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...
108,"arcp://name,farms-to-freeways/collection/joywi...","[RepositoryObject, Person]",1926,17 _Person,Joy Willis,"Joy Willis was born on 5th February, 1926 at S...",St Marys,St Marys,RepositoryObject,"arcp://name,farms-to-freeways/collection/trans...",...,,,,,audio/mpeg,"arcp://name,farms-to-freeways/collection/joywi...",https://purl.archive.org/textcommons/profile#O...,https://www.ethnologue.com/language/eng,txc:Interview,https://data-dev.ldaca.edu.au/api/stream?id=ar...


## Statistical Summaries

In the metadata there is a key called "birthDate" which is a string that only has the birth year of the speaker. One of the birthDate values in the metadata has a string value "c 1924", instead of a simply sequence of digits there is, as shown when the list of birthDates are printed

### Birth Year

In [166]:
new.birthDate_speaker

0        1937
1        1937
2      c 1924
3        1923
4        1926
        ...  
105      1926
106      1926
107      1926
108      1926
109      1926
Name: birthDate_speaker, Length: 110, dtype: object

If 'birthDate' only has the year listed as a text string (str), we would need to convert the birthDate value from str to an integer (int) if we want to do any statistical operations based on birthDate. This conversion can be done by 'type casting', eg, if year is a string that has the value "1924", we simply impose the type int(), such as int(year), so the string year = "1924" => integer, year = 1924, which is then a number (not a string) that can undergo maths operations.

The function int(...) in the following line imposes an integer type conversion. That is, int(x) converts x from whatever type it is into an integer, as long as it makes sense for x to be converted into an integer. For example if x = "abc", then it would be impossible to know what value x would have as an integer. But if a string x = "1924", then the integer would have the value 1924.

In [167]:
# Normalising the birth year and casting them as integers

new.birthDate_speaker = new.birthDate_speaker.apply(lambda year: year if (type(year) == int) else int(year[-4:]))
new.birthDate_speaker

0      1937
1      1937
2      1924
3      1923
4      1926
       ... 
105    1926
106    1926
107    1926
108    1926
109    1926
Name: birthDate_speaker, Length: 110, dtype: int64

<div class="alert alert-block alert-info">
<b>Python Library: datetime</b> 

<br>    
    
The library datetime can provide the current date and time and allows us to do calculations over any date and time, such as determining the difference between time zones.
<br>    
</div>

In [168]:
# Import the module
import datetime

# We can calculate the mean (average) age
this_year = datetime.datetime.now().year
# Create a list called 'age' which takes every year in birth_year as y. Then get this_year and minus that
# number from year y and make sure all those numbers are stored in a list, which is why we have [] around the
# whole sequence of instructions below.
age = [this_year - y for y in new.birthDate_speaker]

# Print the list of the age of all the speakers if they were all alive today
age

[85,
 85,
 98,
 99,
 96,
 96,
 96,
 96,
 96,
 114,
 108,
 102,
 102,
 93,
 93,
 93,
 93,
 93,
 93,
 100,
 100,
 100,
 100,
 96,
 96,
 96,
 96,
 96,
 93,
 93,
 93,
 93,
 93,
 93,
 114,
 114,
 114,
 95,
 96,
 96,
 96,
 96,
 96,
 96,
 96,
 96,
 96,
 96,
 102,
 102,
 114,
 85,
 85,
 109,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 100,
 100,
 100,
 100,
 98,
 98,
 97,
 97,
 98,
 98,
 97,
 97,
 93,
 93,
 93,
 93,
 93,
 93,
 114,
 114,
 114,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 93,
 114,
 114,
 114,
 100,
 100,
 100,
 100,
 100,
 100,
 100,
 100,
 96,
 96,
 96,
 96,
 96]

<div class="alert alert-block alert-info">
<b>Python Library: statistics</b> 

<br>    
    
The statistics library provides functions to calculate simple statistics, such as the mean, mode, standard deviation, etc., over numeric data.<br>    
</div>

In [169]:
# Import the module
import statistics

print('== AGE ==')
# Print the mean age, which is the average age of all the speakers
print('MEAN:', statistics.mean(age))
# The mode is the most freqently occur age. That is, there are more speakers of this age than any other.
print('MODE:', statistics.mode(age))
# The median is the middle value if the age of the participants were listed in order.
print('MEDIAN:', statistics.median(age))
# The standard deviation is a statistical metric that gives us an indication of how dispersed the age range of the speakers is
print('STD DEV:', "{:.1f}".format(statistics.stdev(age)))
print()

print('== BIRTH YEAR ==')
# Print the mean, median, mode and standard deviation of the birth year
print('MEAN:', statistics.mean(new.birthDate_speaker))
print('MEDIAN:', statistics.median(new.birthDate_speaker))
print('MODE:', statistics.mode(new.birthDate_speaker))

== AGE ==
MEAN: 97.56363636363636
MODE: 93
MEDIAN: 96.0
STD DEV: 6.7

== BIRTH YEAR ==
MEAN: 1924.4363636363637
MEDIAN: 1926.0
MODE: 1929


<div class="alert alert-block alert-info">
<b>The Counter Container</b> 

<br>    
    
The Counter container monitors the number of equivalent elements that have been added to it. Learn more about it here: https://docs.python.org/3/library/collections.html#collections.Counter<br>    
</div>

### Other Metadata Features: Place

There are other metadata columns that require normalising in this dataframe. For example there is a location 'Penrith' as well as 'Kingston, Penrith', and there is a location 'St. Marys' as well as 'St Marys' (no '.'), as shown when you print the 'address' column.

In [170]:
new.address_speaker

0        Blacktown
1        Blacktown
2      Pendle Hill
3          Penrith
4         St Marys
          ...     
105       St Marys
106       St Marys
107       St Marys
108       St Marys
109       St Marys
Name: address_speaker, Length: 110, dtype: object

In [171]:
# Let's normalise these locations within the dataframe 'new'
# NOTE 'address' is the where the story told in the interview takes place
new.address_speaker = new['address_speaker'].apply(lambda place: place.split(',')[-1].replace('.', '').strip())
new.address_speaker

0        Blacktown
1        Blacktown
2      Pendle Hill
3          Penrith
4         St Marys
          ...     
105       St Marys
106       St Marys
107       St Marys
108       St Marys
109       St Marys
Name: address_speaker, Length: 110, dtype: object

In [172]:
from collections import Counter

place_of_story = new.address_speaker
place_of_birth = new.birthDate_speaker

In [173]:
# How many of the interviews talked about a certain location/city/suburb
count_story_place = dict(Counter(new.address_speaker))
pprint.pp(count_story_place, sort_dicts=True)

{'Blacktown': 18,
 'Emu Plains': 20,
 'Mt Druitt': 1,
 'Pendle Hill': 1,
 'Penrith': 42,
 'Plumpton': 1,
 'Quakers Hill': 1,
 'Riverstone': 1,
 'St Marys': 25}


In [174]:
# Count place of birth
count_birth_place = dict(Counter(new.birthDate_speaker))
pprint.pp(count_birth_place, sort_dicts=True)

{1908: 11,
 1913: 1,
 1914: 1,
 1920: 4,
 1922: 16,
 1923: 1,
 1924: 5,
 1925: 4,
 1926: 25,
 1927: 1,
 1929: 37,
 1937: 4}


### Cross-cutting 2 features found in the metadata



In [175]:
# Let's try with 1 suburb, where the suburb = Blacktown
suburb = new.loc[new['address_speaker'] == 'Penrith']
# For all stories set in this suburb, print all the storytellers' birth year 
print(suburb.birthDate_speaker)

3     1923
9     1908
13    1929
14    1929
15    1929
16    1929
17    1929
18    1929
28    1929
29    1929
30    1929
31    1929
32    1929
33    1929
55    1929
56    1929
57    1929
58    1929
59    1929
60    1929
67    1925
68    1925
71    1925
72    1925
73    1929
74    1929
75    1929
76    1929
77    1929
78    1929
82    1929
83    1929
84    1929
85    1929
86    1929
87    1929
88    1929
89    1929
90    1929
91    1929
92    1929
93    1929
Name: birthDate_speaker, dtype: int64


In [176]:
# Print place of story and speaker's birth year
all_suburbs = list(count_story_place.keys())
all_suburbs

['Blacktown',
 'Pendle Hill',
 'Penrith',
 'St Marys',
 'Plumpton',
 'Emu Plains',
 'Quakers Hill',
 'Mt Druitt',
 'Riverstone']

In [184]:
names = list(suburb.name)
names = list( dict.fromkeys(names) )

In [185]:
names

['Heather Corr ', 'Florence Gibbons', 'Marjory Turner', 'Doreen Scott']

In [200]:
# Traverse through the suburbs and print the data we are interested in
# In addition, let's save this information in a dictionary called 'suburbs'
# so we don't have to bother with dataframes
suburbs = dict()
for s in all_suburbs:
    suburbs[s] = dict()
    places = new.loc[(new['address_speaker'] == s)]
    print('## ===', s, '-- total:', len(places))
    # NOTE: index is the internal reference for the row in the dataframe called 'p'
    for index, i in places.iterrows():
        # initialise each person's info
        person = dict()
        # name
        name = i['name']
        print(name)
        # birthPlace
        birthPlace = i['birthPlace_speaker']
        person['birthPlace_speaker'] = birthPlace
        print(birthPlace)
        # birthDate
        birthDate = i['birthDate_speaker']
        person['birthDate_speaker'] = birthDate
        print(birthDate)        
        # dialogue files
        person_files = []
        print(person)
        if i['indexableText.@id']:
            person_files.append(i['indexableText.@id'])
#             if f['@id']:
#                 print(f['@id'])
#                 if f['@id'].endswith('.csv'):
#                     print(f['@id'])
#                     person_files.append(f['@id'])
        person['files'] = person_files
        print(person_files)
        print()
        
        suburbs[s].update({name: person})

## === Blacktown -- total: 18
Patricia Parker
Sydney
1937
{'birthPlace_speaker': 'Sydney', 'birthDate_speaker': 1937}
['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/427/original_bad0fd7f9c918df1db8b6a5b39faec48.csv']

Patricia Parker
Sydney
1937
{'birthPlace_speaker': 'Sydney', 'birthDate_speaker': 1937}
['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/444/original_fa45f94ab5670c1ebf4fa24f4075e04f.csv']

Edna Vidler
Newtown
1908
{'birthPlace_speaker': 'Newtown', 'birthDate_speaker': 1908}
['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/454/original_e29fff27016bc12bc659e5af4fdf089b.csv']

Edna Vidler
Newtown
1908
{'birthPlace_speaker': 'Newtown', 'birthDate_speaker': 1908}
['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/438/original_d27f1917186b18b32eea1c0676d747df.csv']

Edna Vidler
Ne

In [201]:
pprint.pprint(suburbs)

{'Blacktown': {'Clare Pfoeffer': {'birthDate_speaker': 1913,
                                  'birthPlace_speaker': 'Parramatta',
                                  'files': ['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/445/original_12adc492f3dcf6b71010deb05d86dfca.csv']},
               'Edna Vidler': {'birthDate_speaker': 1908,
                               'birthPlace_speaker': 'Newtown',
                               'files': ['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/457/original_e25684d39da3cb18d2a2a8dedcab0d1b.csv']},
               'Olga Robshaw': {'birthDate_speaker': 1924,
                                'birthPlace_speaker': 'Riverstone',
                                'files': ['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/449/original_7b7260385251965e3a899397df73a0dc.csv']},
               'Patricia Parker': {'b

### Sanity Check

Let's print out the information on 1 person to check that our data is looking as we expect it to.

In [202]:
NAME = 'Patricia Parker'
PLACE = 'Blacktown'

# Print the whole dict structure for Amelia Vincent
suburbs[PLACE][NAME]

{'birthPlace_speaker': 'Sydney',
 'birthDate_speaker': 1937,
 'files': ['https://data-dev.ldaca.edu.au/api/stream?id=arcp://name,farms-to-freeways/corpus/root&path=files/444/original_fa45f94ab5670c1ebf4fa24f4075e04f.csv']}

In [203]:
# Print the birthPlace
suburbs[PLACE][NAME]['birthPlace_speaker']

'Sydney'

<div class="alert alert-block alert-info">
    <b>Downloading a file from the ReST API</b> 
    <br>    
    We have the reference of each file stored in LDaCA. 
    <br>    
    We use pandas to download and attach the API_TOKEN to each request
</div>

In [204]:
# files are a list, so let's create a list of dataframes to save all the contents of each file
dataframes = list()  # we have a list of files so let's save them as a list of dataframes
for f in suburbs[PLACE][NAME]['files']:
    df = pd.read_csv(f, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    dataframes.append(df)

# How many files are there in the list?
len(dataframes)

1

In [205]:
# There is only 1 file in the list. Actually there is only ever 1 file in every files list
dataframes[0]

Unnamed: 0,time,speaker,text,notes
0,,,,"INTERVIEW NO. 2 DATE OF INTERVIEW: 29thAugust,..."
1,0.33,B,"My name is Patricia Hazel Parker, nee Lawrence...",
2,1.05,A,"Pat, can you tell me a little about your famil...",
3,,B,He did a wide variety of things over the years...,
4,1.52,A,And did you have any brothers or sisters?,
...,...,...,...,...
104,19.04,A,It hasn’t progressed since then?,
105,,B,I don't think so. I think it took on a charact...,
106,19.55,B,The other thing I forgot to say was that my hu...,
107,20.38,A,Well thank you very much Pat. It has been very...,


## Counting BiGrams

Let's use textacy and spacy to process the text from each of the files and count the bigrams.

<div class="alert alert-block alert-info">
<b>Text Processing</b> 
<br>    
<ul>
    <li>textacy: to find bigrams</li>
    <li>spacy: to ingest and process the text</li>
</ul>    
    
<br>    
</div>

In [206]:
import textacy
import spacy

# Load the language model
nlp = spacy.load("en_core_web_sm")

In [207]:
blacktown = list()
print('## == BLACKTOWN')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    blacktown.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(blacktown))

## == BLACKTOWN
Patricia Parker
	CUMULATIVE TOTAL 108
Edna Vidler
	CUMULATIVE TOTAL 513
Clare Pfoeffer
	CUMULATIVE TOTAL 625
Olga Robshaw
	CUMULATIVE TOTAL 894


In [208]:
penrith = list()
print('## == Penrith')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    file = person_data['files'][0]
    df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
    df.fillna('', inplace=True)
    text = list(df.text)
    text.remove('')
    penrith.extend(text)
    print(person)
    print('\tCUMULATIVE TOTAL', len(penrith))

## == Penrith
Heather Corr 
	CUMULATIVE TOTAL 300
Florence Gibbons
	CUMULATIVE TOTAL 440
Marjory Turner
	CUMULATIVE TOTAL 652
Doreen Scott
	CUMULATIVE TOTAL 824


In [209]:
text_b = nlp(' '.join(blacktown))
ngrams_b = list(textacy.extract.basics.ngrams(text_b, 2, min_freq=10))

In [210]:
words_b = [w.text.lower() for w in ngrams_b]

In [211]:
from collections import Counter
cb = Counter(words_b)
cb

Counter({'early days': 12,
         'years ago': 29,
         'oh yes': 13,
         'fuel stove': 13,
         'seven hills': 15})

In [212]:
text_p = nlp(' '.join(penrith))
ngrams_p = list(textacy.extract.basics.ngrams(text_p, 2, min_freq=10))
words_p = [w.text.lower() for w in ngrams_p]
cp = Counter(words_p)
cp

Counter({'henry street': 12,
         'high school': 28,
         'st. marys': 10,
         'little bit': 14})

In [213]:
overlapping = [w for w in cp if w in cb]
overlapping

[]

In [214]:
unique_blacktown = [w for w in cb if w not in cp]
unique_blacktown

['early days', 'years ago', 'oh yes', 'fuel stove', 'seven hills']

In [215]:
unique_penrith = [w for w in cp if w not in cb]
unique_penrith

['henry street', 'high school', 'st. marys', 'little bit']

In [217]:
birth_year_blacktown = [suburbs['Blacktown'][name]['birthDate_speaker'] for name in suburbs['Blacktown']]
birth_year_blacktown.sort()
birth_year_blacktown

[1908, 1913, 1924, 1937]

In [218]:
birth_year_penrith = [suburbs['Penrith'][name]['birthDate_speaker'] for name in suburbs['Penrith']]
birth_year_penrith.sort()
birth_year_penrith

[1908, 1923, 1925, 1929]

### Counting n-grams for Suburb given Birth Year

In [219]:
blacktown_pre20s = list()
print('## == BLACKTOWN PRE-1920')
for person in suburbs['Blacktown']:
    person_data = suburbs['Blacktown'][person]
    if person_data['birthDate_speaker'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        blacktown_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(blacktown_pre20s))

## == BLACKTOWN PRE-1920
Edna Vidler
	CUMULATIVE TOTAL 405
Clare Pfoeffer
	CUMULATIVE TOTAL 517


In [220]:
penrith_pre20s = list()
print('## == PENRITH PRE-1920')
for person in suburbs['Penrith']:
    person_data = suburbs['Penrith'][person]
    if person_data['birthDate_speaker'] < 1920:
        file = person_data['files'][0]
        df = pd.read_csv(file, storage_options={'Authorization': 'Bearer %s' % API_TOKEN})
        df.fillna('', inplace=True)
        text = list(df.text)
        text.remove('')
        penrith_pre20s.extend(text)
        print(person)
        print('\tCUMULATIVE TOTAL', len(penrith_pre20s))

## == PENRITH PRE-1920
Florence Gibbons
	CUMULATIVE TOTAL 140


In [221]:
text_b20 = nlp(' '.join(blacktown_pre20s))
ngrams_b20 = list(textacy.extract.basics.ngrams(text_b20, 2, min_freq=10))
words_b20 = [w.text.lower() for w in ngrams_b20]
cb20 = Counter(words_b20)
cb20

Counter({'years ago': 28, 'oh yes': 11, 'seven hills': 15, 'early days': 10})

In [222]:
text_p20 = nlp(' '.join(penrith_pre20s))
ngrams_p20 = list(textacy.extract.basics.ngrams(text_p20, 2, min_freq=10))
words_p20 = [w.text.lower() for w in ngrams_p20]
cp20 = Counter(words_p20)
cp20

Counter()

<div class="alert alert-block alert-success">
<b>You Can Extend this Notebook</b> 
    
<ul>
<li> You change this notebook studying different suburbs.</li>
<li> Rather than examining the vocabulary of those born before 1920, you can look at the stories of those who were born later.</li>
<li> Try looking at unigrams or trigrams instead of bigrams.</li>
<li> The minimum frequency of bigrams was 10. You can increase or decrease this threshold.</li>
</ul>    
<br>
</div>