<h1>Sources of data</h1>
<li><b>local files</b>csv, xls, txt, pdf</li>
<li><b>database servers</b>
<ul>
<li>Relational databases
<li>NoSQL databases
</ul>
<li><b>On the web</b>
<ul>
<li>html
<li>JSON
<li>XML
</ul>

<h2>Getting data from the Internet</h2>
<li>Accessing the data is through APIs or web scraping
<li>Data usually arrives in either JSON, XML or HTML 
<li>Python has libraries to help getting the data as well as extracting "useful" data

<h2>The <i>requests</i> library</h2>
<li>The primary mechanism for sending an API request or accessing a web server

<h3>Step 1: Import the requests library</h3>

In [1]:
import requests

<h3>Step 2: Send an HTTP request, get the response, and save in a variable</h3>

In [9]:
response = requests.get("http://www.epicurious.com/search/Tofu+Chili")

In [3]:
response

<Response [200]>

<h3>Step 3: Check the response status code to see if everything went as planned</h3>
<li>status code 200: the request response cycle was successful
<li>any other status code: it didn't work as expected (e.g., 404 = page not found)

In [10]:
print(response.status_code)

200


In [5]:
response.content



<h3>WWW data is usually encoded</h3>
<li>The b' in front of the response content indicates an encoded byte string
<li>The <meta charset="utf-8"> indicates that the page is using "utf-8" encoding
<li>utf-8 == <i>Unicode</i> variable length character encoding system
<li>Data received from the world wide web is usually in utf-8
<li>Python strings are "plain" (unencoded) character sequences
<li><b>Corollary!</b> We need to convert the utf-8 string into a python str

<a href="http://www.diveintopython3.net/strings.html">Click here!</a> if you want to know all about strings and character encoding

In [6]:
print(type(response.content))
print(type(response.content.decode('utf-8')))

<class 'bytes'>
<class 'str'>


<h3>Step 4: Get the content of the response</h3>
<li>Convert to utf-8 if necessary

In [7]:
response.content.decode('utf-8')



<h3>In-class problem</h3>
<li>Get the contents of Wikipedia's main page and look for the string "Did you know" in it
<li>At what location is it on the page?

In [20]:
url = "https://en.wikipedia.org/wiki/main_page"
#The rest of your code should go below this line
#import
import requests
response = requests.get(url)
response.content
stringcontent = response.content.decode('utf-8')
location = stringcontent.find('Did you know')
location
if location > -1:
    print('found it')
else:
    print('opps')

found it


<h1>JSON: JavaScript Object Notation</h1>

<li>Standard for "serializing" data objects for storage or transmission 
<li>Human-readable, useful for data interchange
<li>Also useful for representing and storing semistructured data
<li>Stored as plain (byte strings or utf-8 strings) text
<li>Contains data type information

<h2>json</h2>
<li>The python library - json - deals with converting text to and from JSON


<h2>Python and JSON data types</h2>
<table align="left">
<tr><td>JSON</td><td>Python</td></tr>
<tr><td>number</td>	<td>int,float</td></tr>
<tr><td>string</td>	<td>str</td></tr>
<tr><td>Null</td>	<td>None</td></tr>
<tr><td>true/false</td>	<td>True/False</td></tr>
<tr><td>Object</td>	<td>dict</td></tr>
<tr><td>Array</td>	<td>list</td></tr>
</table>

<b>json.loads converts a json object into a python data object</b>

In [21]:
import json
data_string = '[{"b": [2, 4], "c": 3.0, "a": "A"},34]'
python_data = json.loads(data_string)
print(python_data)
data_string
print(type(python_data))

[{'b': [2, 4], 'c': 3.0, 'a': 'A'}, 34]
<class 'list'>


<h3>json.loads recursively decodes a string in JSON format into equivalent python objects</h3>
<li>data_string's outermost element is converted into a python list (in the example!)
<li>the first element of that list is converted into a dictionary
<li>the key of that dictionary is converted into a string
<li>the value of that dictionary is converted into a list of two integer elements

In [22]:
print(type(data_string),type(python_data))
print(type(python_data[0]),python_data[0])
print(type(python_data[0]['b']),python_data[0]['b'])

<class 'str'> <class 'list'>
<class 'dict'> {'b': [2, 4], 'c': 3.0, 'a': 'A'}
<class 'list'> [2, 4]


In [36]:
python_data[0]['b']

[2, 4]

<h3>json.loads will throw an exception if the format is incorrect</h3>

In [26]:
#Wrong. WHY?
# json.loads("Hello")
#Correct


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

<h3>json.dumps creates a json string from a python data object</h3>

In [37]:
import json
data_string = json.dumps(python_data)
print(type(data_string))
print(data_string)
data_string

<class 'str'>
[{"b": [2, 4], "c": 3.0, "a": "A"}, 34]


'[{"b": [2, 4], "c": 3.0, "a": "A"}, 34]'

<h1>API: Application Programming Interface</h1>
<li>A protocal containing a set of commands or functions that allow one piece of software  to talk to another 
<li>Data from the web is often gotten through an API
<li>Web APIs usually consist of two parts:
<ul>
<li><b>request</b> an well-formed HTTP request to a server
<li><b>response</b> a response from the server, usually either an html page or a JSON object
</ul>

<h2>The HTTP request</h2>
<li>Contains a url
<li>Contains a set of parameters required by the server to figure out what data to send back
<li>Often, a parameter is a unique <b>access key</b> that the server uses to keep track of who is requesting the data

<h1>API example: Google Geocoding API</h1>
<li><a href="https://developers.google.com/maps/documentation/geocoding/start">Documentation</a>
<li>Google has a large number of map and location related APIs
<li>You need an account and an API key to use these APIs
<li>To set up an account and get a key:
<ul>
<li>go to <a href="https://cloud.google.com/">google cloud</a>
<li>click "go to console" or "try gcp for free"
<li>if creating a new account, enter all details 
<li>go to API and services
<li>click "Enable APIs" and search for geocoding api
<li>click on credentials and create an API key


<h2>requests library and API requests</h2>

In [38]:
#My api_key
with open("/Users/ya/Desktop/GCPAPIKey.txt",'r') as f:
    api_key = f.read().strip()


FileNotFoundError: [Errno 2] No such file or directory: '/Users/ya/Desktop/GCPAPIKey.txt'

In [41]:
address="Columbia University, New York, NY"
address=address.replace(' ','_')
#api_key=""
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s&key=%s" % (address,api_key)
response = requests.get(url)
print(type(response))


NameError: name 'api_key' is not defined

In [40]:
response.content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":847600508,"wgRevisionId":847600508,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMont

<h4>requests can automatically decode and convert a json response into a python object

In [None]:
type(response.json())

<h3>Exception checking!</h3>
<li>Ideally, we should always check if the data grab has been successful
<li>Especially if we are incorporating our results into a "live" analysis

In [42]:
response_data = ''
address="Columbia University, New York, NY"
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s&key=%s" % (address,api_key)
try:
    response = requests.get(url)
    if not response.status_code == 200:
        print("HTTP error",response.status_code)
    else:
        try:
            response_data = response.json()
        except:
            print("Response not in valid JSON format")
except:
    print("Something went wrong with requests.get")
print(type(response_data))
print(response_data)

NameError: name 'api_key' is not defined

<b>Try this</b>: Write a function that takes an address as an argument and returns a (latitude, longitude) tuple</h2>

In [None]:
def get_lat_lng(address_string,api_key):

    return (lat,lng)
    
get_lat_lng("Columbia University",api_key)

In [None]:
get_lat_lng("London Business School",api_key)

In [None]:
get_lat_lng("Monash University",api_key)

<h1>XML</h1>
<li>eXtensible Markup Language
<li>data is stored in a tree
<li>data items are "tagged" with named values
<li>html is (loosely) similar to XML (both are based on SGML)
<li>The python library - lxml - deals with converting an xml string to python objects and vice versa</li>

In [5]:
data_string = """
<Bookstore>
   <Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">
      <Title>New York Deco</Title>
      <Authors>
         <Author Residence="New York City">
            <First_Name>Richard</First_Name>
            <Last_Name>Berenholtz</Last_Name>
         </Author>
      </Authors>
   </Book>
   <Book ISBN="ISBN-13:978-1579128562" Price="15.80">
      <Remark>
      Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.
      </Remark>
      <Title>Five Hundred Buildings of New York</Title>
      <Authors>
         <Author Residence="Beijing">
            <First_Name>Bill</First_Name>
            <Last_Name>Harris</Last_Name>
         </Author>
         <Author Residence="New York City">
            <First_Name>Jorg</First_Name>
            <Last_Name>Brockmann</Last_Name>
         </Author>
      </Authors>
   </Book>
</Bookstore>
"""

In [6]:
from lxml import etree
root = etree.XML(data_string)
print(root.tag,type(root.tag))

Bookstore <class 'str'>


<h4>XML trees are stored as utf-8 byte strings</h4>

In [8]:
print(etree.tostring(root, pretty_print=True))

b'<Bookstore>\n   <Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">\n      <Title>New York Deco</Title>\n      <Authors>\n         <Author Residence="New York City">\n            <First_Name>Richard</First_Name>\n            <Last_Name>Berenholtz</Last_Name>\n         </Author>\n      </Authors>\n   </Book>\n   <Book ISBN="ISBN-13:978-1579128562" Price="15.80">\n      <Remark>\n      Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.\n      </Remark>\n      <Title>Five Hundred Buildings of New York</Title>\n      <Authors>\n         <Author Residence="Beijing">\n            <First_Name>Bill</First_Name>\n            <Last_Name>Harris</Last_Name>\n         </Author>\n         <Author Residence="New York City">\n            <First_Name>Jorg</First_Name>\n            <Last_Name>Brockmann</Last_Name>\n         </Author>\n      </Authors>\n   </Book>\n</Bookstore>\n'


In [31]:
print(etree.tostring(root, pretty_print=True).decode("utf-8"))

<Bookstore>
   <Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">
      <Title>New York Deco</Title>
      <Authors>
         <Author Residence="New York City">
            <First_Name>Richard</First_Name>
            <Last_Name>Berenholtz</Last_Name>
         </Author>
      </Authors>
   </Book>
   <Book ISBN="ISBN-13:978-1579128562" Price="15.80">
      <Remark>
      Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.
      </Remark>
      <Title>Five Hundred Buildings of New York</Title>
      <Authors>
         <Author Residence="Beijing">
            <First_Name>Bill</First_Name>
            <Last_Name>Harris</Last_Name>
         </Author>
         <Author Residence="New York City">
            <First_Name>Jorg</First_Name>
            <Last_Name>Brockmann</Last_Name>
         </Author>
      </Authors>
   </Book>
</Bookstore>



<h3>Iterating over an XML tree</h3>
<li>Use an iterator. 
<li>The iterator will generate every tree element for a given subtree

In [32]:
for element in root.iter():
    print(element)

<Element Bookstore at 0x1060a4d88>
<Element Book at 0x1060a4f08>
<Element Title at 0x1060a4fc8>
<Element Authors at 0x10610c108>
<Element Author at 0x1060a4f08>
<Element First_Name at 0x1060a4fc8>
<Element Last_Name at 0x10610c108>
<Element Book at 0x10610c148>
<Element Remark at 0x1060a4f08>
<Element Title at 0x1060a4fc8>
<Element Authors at 0x10610c148>
<Element Author at 0x1060a4f08>
<Element First_Name at 0x1060a4fc8>
<Element Last_Name at 0x10610c148>
<Element Author at 0x10610c108>
<Element First_Name at 0x1060a4f08>
<Element Last_Name at 0x1060a4fc8>


<h4>Or just use the child in subtree construction

In [7]:
for child in root:
    for thing in child:
        print(thing)

<Element Title at 0x105555248>
<Element Authors at 0x105555e48>
<Element Remark at 0x105532808>
<Element Title at 0x105555248>
<Element Authors at 0x105555e48>


<h4>Accessing the tag</h4>


In [10]:
for child in root:
    print(child.tag)

Book
Book


<h4>Using the iterator to get specific tags<h4>
<li>In the below example, only the author tags are accessed
<li>For each author tag, the .find function accesses the First_Name and Last_Name tags
<li>The .find function only looks at the children, not other descendants, so be careful!
<li>The .text attribute prints the text in a leaf node

In [11]:
for element in root.iter("Author"):
    #print(element)
    print(element.find('First_Name').text,element.find('Last_Name').text)

Richard Berenholtz
Bill Harris
Jorg Brockmann


In [12]:
for element in root.findall('Book/Authors/Author/First_Name'):
    print(element.text)

Richard
Bill
Jorg


<h4>Problem: Find the last names of all authors in the tree “root” using xpath</h4>

In [9]:
for element in root.findall('Book/Authors/Author/Last_Name'):
    print(element.text)

Berenholtz
Harris
Brockmann


<h4>Using values of attributes as filters</h4>
<li>Example: Find the first name of the author of a book that weighs 1.5 oz

In [16]:
root.find('Book[@Weight="1.5"]/Authors/Author/First_Name').text

'Richard'

<b>Try This</b>: Print first and last names of all authors who live in New York City