# Sample Workflow: Alveo

This worksheet pulls data from the Alveo API and performs some NER using Spacy.

Before you begin, please ensure that you have a `secret.json` file in the current working directory (generally this is your workspace.)<br />If you haven't got this file, run the ***Set up secrets*** notebook first, then return here.

In [None]:
# Install dependencies first
!curl -s -O -L https://raw.githubusercontent.com/HASSCloud/TinkerStudio-Examples/master/{requirements.txt,utils.py}
!pip install -q -r requirements.txt
!pip install requests_toolbelt

In [None]:
import spacy
import csv
import geocoder
import pandas as pd
import re
import utils
import pyalveo

Alveo requires a login and uses an API key to validate user requests. We read this from the file `secret.json`.  

The data we will work with is represented by an [item list](http://alveo.edu.au/documentation/discovering-and-searching-the-collections/saving-your-search-results-to-an-item-list/) in Alveo - this is a list of items selected via a query as the starting point for a research project.   In this case I've selected three items from the [Braided Channels](https://app.alveo.edu.au/catalog/braidedchannels) collection that contains transcripts of oral history interviews.  Each item list has a URL and we refer to that here. 

In [None]:
API_KEY = utils.secret('alveo')
API_URL = "https://app.alveo.edu.au/"
#item_list_url = "https://app.alveo.edu.au/item_lists/1387"
item_list_url = "https://app.alveo.edu.au/item_lists/1172"

We create an API client with the pyalveo module and use the client to get the item list details.  We then get the _primary text_ for each item.  We store these in a python list of texts.

In [None]:
client  =  pyalveo.Client(api_key=API_KEY, api_url=API_URL)
itemlist = client.get_item_list(item_list_url)

print("Item list name: ", itemlist.name())

texts = []
for itemurl in itemlist:
    item = client.get_item(itemurl)
    text = item.get_primary_text()
    text = text.decode() # convert from bytes to a string
    text = re.sub('\W+', ' ', text)
    texts.append(text) 

print("Got", len(texts), "texts")

## NER Using Spacy

We will use Spacy to extract Named Entities from the text.   We download the appropriate models and initialise an NLP processor. 

In [None]:
# download the spacy models we need
model = 'en_core_web_sm'
spacy.cli.download(model)
nlp = spacy.load(model)

We then extract entities from the texts.  The results will be converted to a Pandas data frame. In this example we retain all of the entity types in the result and for each result include a _context_ string showing the words each side of the entity that was found. 

In [None]:
places = []

for text in texts:
    doc = nlp(text)
    for ent in doc.ents:
        context = doc[ent.start-2:ent.end+3]
        context = " ".join([w.text for w in context])
        d = {'label': ent.label_, 'text': ent.text, 'context': context}
        places.append(d)

entities = pd.DataFrame(places)
print("Found ", entities.shape[0], "entities in the texts")
entities.head()

We might be particularly interested in the GPE entities - locations.  We can select these as follows

In [None]:
locations = entities[entities['label'] == 'GPE']
locations.head()

We can then plot the frequency of occurence of each place name in the texts. 

In [None]:
%matplotlib inline
grouped = locations.groupby('text')
counts = grouped.size()
counts.plot.bar()