# Lecture 6 - Unstructured Data and APIs

<center>
<img src="images/unstructured_data.jpeg" width="500">
</center>

## Definition of unstructured data

Data that does not have a clear pre-defined structure.

* Text documents
* Websites
* Videos
* Course documents

Definition is imprecise because "structure" may be implicit or hidden.

## Working with "unstructured" data

To analyze "unstructured" data you must impart some structure on it. Schema-free databases facilitate working with data with no (or ill-defined) structure.

** schema-free vs. schema** is more clearly defined than ** structured vs. unstructured**.

#### Schema-free advantages

* flexible
* quick to set up
* easy to evolve/reconfigure

#### Schema-free disadvantages

* slow(er) to query
* harder to maintain
* some structure must still be defined

## How to work with "unstructured" data?

#### Extracting "structure" from unstructured data
* text processing (natural language processing) - entire field of CS!
* data "scraping" - BeautifulSoup python package
* API's (technically not "unstructured")

Text processing and data scraping are beyond the scope of this class, but there are many tutorials online. API's (Application Programming Interface) are a common way of automatically accessing the structured version of "unstructured" data.

#### How to work with "unstructured" data
* JSON files - flexible "partially structured" data format
* schema-free "databases" (MongoDB, ElasticSearch)

We will not cover schema-free databases in lecture, but MongoDB is easy to set up and has a nice Python interface (`pymongo`). 


## JSON Files

JavaScript Object Notation (JSON) files are a common way of adding structure to data so that it is easier to pass between code and interact with programatically. Although originally developed for JavaScript, JSON is now one of the most widespread file types and is supported by most programming languages.

JSON files are very intuitive to use with Python because they are basically just dictionaries and lists. 

In [None]:
import json

info = '{"course":"ChBE 4803", "instructors": ["Medford", "Comer"], "size":45}' #<- note single/double quotes!
js_info = json.loads(info) #<- json.loads loads from a string, json.load loads from a file.

js_info.keys()
js_info['course']

#JSON is a great format for persistent storage of Python data structures:

with open('test.json','w') as f:
    json.dump(js_info, f)

In [None]:
! ls

In [None]:
with open('test.json','r') as f:
    new_info = json.load(f)
    
new_info.keys()
new_info_dict = dict(new_info)
new_info_dict

## Example: PubChem database

[PubChem Search](https://pubchem.ncbi.nlm.nih.gov/)

* Extract and work with JSON representation
* Use RESTful API to access data programatically
* Demonstrate Python "wrapper" for the API

#### Goal 1: Extract SMILES representation, molecular weight, and boiling point from PubChem JSON file.

In [None]:
import json

with open('ammonia.json') as f:
    nh3 = json.load(f)

In [None]:
# Explore JSON structure

Working with JSON data can be challenging if there are many nested structures, headers, etc. It is very useful to use a visualization tool:

* [JSON Viewer](http://jsonviewer.stack.hu/)
* [Code Beautify](https://codebeautify.org/jsonviewer)
* [Chrome Extension](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh?hl=en-US)

From the visualizer we can see how to extract the information we need.

In [None]:
SMILES = nh3['Record']['Section'][3]['Section'][2]['Section'][3]['Information'][0]['StringValue']
MW = nh3['Record']['Section'][4]['Section'][0]['Information'][0]['Table']['Row'][0]['Cell'][1]['NumValue']
BP = nh3['Record']['Section'][4]['Section'][1]['Section'][3]['Information'][2]['StringValue']
BP, C = BP.split('°')
SMILES

## What can go wrong here???

Converting unstructured information to structured is tedious! The goal and challenge is to not just do this once, but do it in a way that works for other inputs. This can be even more challenging:

In [None]:
with open('cisplatino.json') as f:
    cp = json.load(f)

In [None]:
SMILES = cp['Record']['Section'][3]['Section'][2]['Section'][3]['Information'][0]['StringValue']

In [None]:
def section_by_name(sections, name):
    """ Take a list of Sections from PubChem JSON and return the section with a given name"""
    for s in sections:
        if s['TOCHeading'] == name:
            return s
        
section_by_name(cp['Record']['Section'], "Names and Identifiers")

We can use this new function to create a more robust way of extracting info from the PubChem JSON:

In [None]:
def get_info(pc_json):
    """ Return the SMILES string, molecular weight, and boiling point from a PubChem JSON file"""
    info = {} #<- we can store the info in this string as we grab it
    ## Get SMILES string:
    namesec = section_by_name(pc_json['Record']['Section'], "Names and Identifiers")
    descsec = section_by_name(namesec['Section'], "Computed Descriptors")
    smilesec = section_by_name(descsec['Section'], "Canonical SMILES")
    SMILES = smilesec['Information'][0]['StringValue'] #<- we are assuming that there is only one entry here.
    info['SMILES'] = SMILES
    
    ## Get molecular weight
    propsec = section_by_name(pc_json['Record']['Section'],'Chemical and Physical Properties')
    compsec = section_by_name(propsec['Section'],'Computed Properties')
    MW = compsec['Information'][0]['Table']['Row'][0]['Cell'][1]['NumValue'] #<- we are assuming the table has a fixed structure
    info['molecular_weight'] = MW
    
    ## Get boiling point
    
    ### boiling point is in the same properties section as molecular weight, so start from there
    expsec = section_by_name(propsec['Section'],"Experimental Properties")
    bpsec = section_by_name(expsec['Section'], "Boiling Point")
    bpstring = bpsec['Information'][2]['StringValue'] #<- two problems!
    
    ## discuss how to handle problems
    
    return(info)
    
get_info(cp)

Even with semi-structured data (JSON), it can be challenging to robustly and reliably extract structured information for analysis!

## API's (Application Programming Interfaces)

API's are like GUI's for experts. They are not limited to "unstructured" data, or even data in general. API is a term for any programmatic structure that makes it easier to interact with a more complex underlying code or data structure. However, they are particularly prevalent in data science because accessing data is much less painful.

## RESTful API's

REST stands for "representational state transfer", and is a protocol that enables accessing data directly through a URL. This is a very common and very powerful approach because it allows the data provider to abstract the database back-end from the API. In other words, data providers can provide a uniform interface to data in relational (schema-driven) databases, schema-free databases, file servers, or services in any programming language. All the user needs to know is how to "query" from a URL. If you pay attention to URL's as you browse the web you will see that you use RESTful API's all the time without knowing it!

<center>
<img src="images/RESTful.png" width="500">
</center>

RESTful API's are simple enough that you can use them without specialized libraries. You just need to use HTTP protocol, which is implemented in the `requests` Python library:

In [None]:
import requests

response = requests.get("http://www.chbe.gatech.edu/")
response.text

RESTful API's are designed to return data in specific structures, and respond to specific queries that are embedded in the URL. A few notes:

* Many API's require a "key" or "token". This is to avoid spammers overloading their servers.
* Most API's also limit the amount of data per request, and the rate of requests.
* It is still necessary to understand the underlying structure of the data you are querying.

You should always start by reading the documentation of an API to learn what you can/can't do.

#### Goal 2: Use PubChem RESTful API to automatically get SMILES representation of a given compound

[PubChem API tutorial documentation](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial$_Toc458584421)

[PubChem API full documentation](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest)

Let's start by seeing if we can get the "search" part to work

In [None]:
r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ammonia/cids/TXT')
r.text #<- this is the CID of the compound    

In [None]:
def get_CID(chemical):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/cids/TXT'.format(chemical))
    return r.text

cid = get_CID('ammonia')
print(cid)

Now we need to understand the structure of the query to decide how to search. From the documentation:

* prolog: `https://pubchem.ncbi.nlm.nih.gov/rest/pug`

* input: `/compound/name/ammonia`

* operation: `/cids`

* output: `/TXT`

We already have the input operation working, and since we just want SMILES the output can also be TXT. We just need to modify the operation.

In [None]:
def get_SMILES(chemical):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/property/CanonicalSMILES/TXT'.format(chemical))
    return r.text

N = get_SMILES('cisplatino')
print(N)

This is much easier, less memory intensive, and more robust, than trying to extract the property from the full output! However, if you do really want to parse from the full output you can do that too:

In [None]:
def get_full(chemical):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/record/json'.format(chemical))
    return r.text

json_string = get_full('ammonia')
nh3 = json.loads(json_string)
nh3['PC_Compounds'] #<- Note that this JSON is in a very different structure from the original!

## Python API's

RESTful API's are widely used and easy to interact with. However, reading the documentation and converting more complex queries into the proper URL can be tedious and time consuming. Furthermore, not all data sources use RESTful API's.

Python is one of the most common languages for API's, and widely-used data sources (e.g. PubChem) will often have a Python "wrapper" for their RESTful API.

We can use the [PubChemPy](https://pypi.python.org/pypi/PubChemPy/1.0) API to achieve the same goal, but we will need to install it first:

`pip install PubChemPy`

In [None]:
import pubchempy as pcp
#Now we have access to some more intuitive function names and documentation
#help(pcp)
dir(pcp)

Python APIs make code more readable, and are more intuitive to learn:

In [None]:
compounds = pcp.get_compounds('Ammonia','name')
nh3 = compounds[0]
nh3.atoms #<- the full .json output is already parsed into a nice Python data structure
nh3.bonds

We can access the same SMILES string via the Python API:

In [None]:
dir(nh3) #<- the Python API doesn't store the SMILES string by default

In [None]:
p = pcp.get_properties('CanonicalSMILES', 'ammonia', 'name')
print(p) #<- this works, but is it really better than the RESTful version?

## Conclusion

* Unstructured data provides a flexible solution to digital data storage
* Most data generated by others is available only as "unstructured" data
* Unstructured data can be "structured" manually or by using API's

When retrieving data it is a good idea to read about all of the available retrieval strategies (web scraping, direct download, RESTful API's, Python API's) and design a strategy that maximizes efficiency and flexibility.

When storing your own data you should find a balance between "unstructured" and "structured" that makes sense based on your project. Consider setting up a (schema-free) database and/or custom API to create a seamless interface between your data source and your analysis code.

## Further Reading

* [Hitchhiker's Guide to Python JSON tutorial](http://docs.python-guide.org/en/latest/scenarios/json/)
* [RESTful details](https://restfulapi.net/)
* [PubChem RESTful API tutorial](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
* [PubChem Python API documentation](https://pypi.python.org/pypi/PubChemPy/1.0)