# Introduction - Python & Open Collections API

__Types of full text analysis?__

* __Text mining__ is a catch-all term for using algorithms to extract specific kinds of data from a text or group of texts. [EG: find all the times "blood" occurs in Macbeth. Or complicated: sort thousands of books by their genre based on the words they use.]

* __Natural language processing__ is a slightly more specific term -- it is the name for a field of computer science that is invested in the relationship between human language and machines. Natural language processing or NLP approaches the questions of text as data with a focus on how human language actually works. [EG: Find all the places in a text. Tag a text for its parts of speech. Translate a text.]

* __Topic Modeling__ is a very specific method of text analysis that sorts the words in a collection of texts into groups that reflect a probablistic relationship between those words. It allows people to see patterns and relationships between texts in a large corpus. [EG: See how the language and interest of authors of literary criticism evolve over 100 years.]


## URLS

In [14]:
ocUrl = 'https://open.library.ubc.ca/'
ocApiUrl = 'https://oc-index.library.ubc.ca' # APPY URL

## Setting API Key

You can get your own API key at https://open.library.ubc.ca/research

In [15]:
apiKey = 'ac40e6c2cb345593ed1691e0a8b601bba398e42d85f81f893c5ab709cec63c6c'

## Choosing a collection & item

In [40]:
collection = 'bcbooks'
itemId = '1.0222552'
print('\n Using item: '+ocUrl+'collections/'+collection+'/items/'+itemId)


 Using item: https://open.library.ubc.ca/collections/bcbooks/items/1.0222552


## Getting item

In [41]:
import json, requests
itemUrl = ocApiUrl+'/collections/'+collection+'/items/'+itemId+'?apiKey='+apiKey
apiResponse = requests.get(itemUrl).json()
item = apiResponse['data']
print(json.dumps(item, sort_keys=True, indent=4, separators=(',', ': ')))

{
    "AggregatedSourceRepository": [
        {
            "attrs": {
                "classmap": "ore:Aggregation",
                "lang": "en",
                "ns": "http://www.europeana.eu/schemas/edm/dataProvider",
                "property": "edm:dataProvider"
            },
            "explain": "A Europeana Data Model Property; The name or identifier of the organization who contributes data indirectly to an aggregation service (e.g. Europeana)",
            "iri": "http://www.europeana.eu/schemas/edm/dataProvider",
            "label": "Aggregated Source Repository",
            "value": "CONTENTdm"
        }
    ],
    "CatalogueRecord": [
        {
            "attrs": {
                "classmap": "edm:ProvidedCHO",
                "lang": "en",
                "ns": "http://purl.org/dc/terms/isReferencedBy",
                "property": "dcterms:isReferencedBy"
            },
            "explain": "A Dublin Core Terms Property; A related resource that references, cites, 

## Getting the full text

In [53]:
fullText = item['FullText'][0]['value']
print(fullText)

University of California Publications in
ZOOLOGY
VOLUME  XXIV
1922-1926
EDITORS
CHARLES ATWOOD KOFOID
JOSEPH GRINNELL

UNIVERSITY OF CALIFORNIA PRESS
BERKELEY, CALIFORNIACONTENTS
PAGES
y^ A Geographical Study of the Kangaroo Rats of California, by Joseph
Grinnell    |...      3-124
*^2. Birds and Mammals of the Stikine River Region of Northern British
Columbia and Southeastern Alaska, by H. S. Swarth  125-314
y/37 Birds and Mammals of the Skeena River Region of Northern British
Columbia, by Harry S. Swarth  315-394
</4t. Report on a Collection of Birds made by J. R. Pemberton in Patagonia, by Alexander Wetmore  395-474
Index     475-482BIRDS AND MAMMALS OF THE STIKINE
RIVER REGION OF NORTHERN BRITISH
COLUMBIA AND SOUTHEASTERN ALASKA
BY
H. S. SWARTH
(Contribution from the Museum of Vertebrate Zoology of the University of California)
University of California Publications in Zoology
Vol. 24, No. 2, pp. 125-314, plate 8, 34 figures in text
Issued June 17, 1922
BIRDS AND MAMMALS OF THE STIK

## Clean the full text

In [43]:
import re

# Lower case full text
cleanFullText = fullText.lower()
# Remove everything but words
pattern = re.compile('[\W_]+')
cleanFullText = pattern.sub(' ', cleanFullText)

print(cleanFullText)

university of california publications in zoology volume xxiv 1922 1926 editors charles atwood kofoid joseph grinnell university of california press berkeley californiacontents pages y a geographical study of the kangaroo rats of california by joseph grinnell 3 124 2 birds and mammals of the stikine river region of northern british columbia and southeastern alaska by h s swarth 125 314 y 37 birds and mammals of the skeena river region of northern british columbia by harry s swarth 315 394 4t report on a collection of birds made by j r pemberton in patagonia by alexander wetmore 395 474 index 475 482birds and mammals of the stikine river region of northern british columbia and southeastern alaska by h s swarth contribution from the museum of vertebrate zoology of the university of california university of california publications in zoology vol 24 no 2 pp 125 314 plate 8 34 figures in text issued june 17 1922 birds and mammals of the stikine river region of northern british columbia and s

http://voyant-tools.org/