# Getting Data - II

In this section, you will learn to:
- Get data from APIs
    - Use the ```requests``` module to connect to a URL and fetch a response
    - Use ```json.loads()``` to convert a JSON object to a python dictionary
- Read PDF files in python using ```PyPDF2```


### Getting Data from APIs

APIs, or application programming interfaces, are created by companies and organisations to provide restricted access to data. It is very common to get data from APIs for data analysis, for example, you can get financial data (stock prices etc.), social media data (Facebook, Twitter etc. provide APIs), weather data, data about healthcare, music, food and drinks, and from almost every domain. 


Apart from being rich sources of data, there are other reasons to use APIs:
- When the data is being updated in real time. If you use downloaded CSV files, you'll have to download data manually, and update your analysis multiple times. Through APIs, you can automate the process of getting real-time data.
- Easy access to structured and verified data - though you can scrape websites, APIs can directly provide data in structured format, and is of better quality
- Access to restricted data: You cannot scrape all websites easily, and that's often illegal (e.g. Facebook, financial data etc.). APIs are the only way to get this data.

There are many more reasons depending on the use cases and the domain of application.

A list of useful APIs is available here: https://github.com/toddmotto/public-apis

#### Example Use Case: Google Maps Geocoding API

Google Maps provides many APIs, one of which is the <a href="https://developers.google.com/maps/documentation/geocoding/start?authuser=1">Google Maps Geocoding API</a>. You can use it to geocode addresses, i.e. get the latitude-longitude coordinates, and vice-versa. 
    
To use the API, go to <a href="https://developers.google.com/maps/">Google Developers</a>, get an API key, and go to the Geocoding API page.


Once you have an API key, getting the geocoded data of an address is easy. For e.g., if you want to geocode the address "UpGrad, Nishuvi building, Anne Besant Road, Worli, Mumbai", you need to separate the words using a "+", and provide the address and your API key in this format:

https://maps.googleapis.com/maps/api/geocode/json?address=UpGrad,+Nishuvi+building,+Anne+Besant+Road,+Worli,+Mumbai&key=YOUR_API_KEY


Thus, this is a two step process:
- Join the words in the address by a plus and convert it to a form ```words+in+the+address``` 
- Connect to the URL by appending the address and the API key
- Get a response from the API and convert it to a python object (here, a dictionary)


In [12]:
import numpy as np
import pandas as pd

# Need requests to connect to the URL, json to convert JSON to dict
import requests, json
import pprint

# joining words in the address by a "+"
add = "UpGrad, Nishuvi building, Anne Besant Road, Worli, Mumbai"
split_address = add.split(" ")
address = "+".join(split_address)
print(address)



UpGrad,+Nishuvi+building,+Anne+Besant+Road,+Worli,+Mumbai


Now, we can connect to the Google Maps URL using the api key and the address and get a response. Like most APIs, Google Maps returns the geocoded data in a JSON format, which is similar to a python dict.

As seen in the earlier section, we use the ```requests.get(url)``` method to get data from a URL. 

In [13]:
api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"

url = "https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}".format(address, api_key)
r = requests.get(url)

# The r.text attribute contains the text in the response object
print(type(r.text))
print(r.text)

<class 'str'>
{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "75",
               "short_name" : "75",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "Doctor Annie Besant Road",
               "short_name" : "Dr Annie Besant Rd",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Bhim Nagar",
               "short_name" : "Bhim Nagar",
               "types" : [ "political", "sublocality", "sublocality_level_2" ]
            },
            {
               "long_name" : "Worli",
               "short_name" : "Worli",
               "types" : [ "political", "sublocality", "sublocality_level_1" ]
            },
            {
               "long_name" : "Mumbai",
               "short_name" : "Mumbai",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Mumbai",
           

The dict-like structure that you see above is a JSON object, and is the most common way of exchanging data through APIs. We can easily convert the JSON object to a python dict using ```json.loads(json_object)```.

Notice that the JSON object contains various details of the address - the components of the address, the full address, the latitude and the longitude, PIN code, etc. 

Let's convert the JSON to a dictionary, so that we can work with it easily.

In [14]:
# converting the json object to a dict using json.loads()
r_dict = json.loads(r.text)

# the pretty printing library pprint makes it easy to read large dictionaries
pprint.pprint(r_dict)

{'results': [{'address_components': [{'long_name': '75',
                                      'short_name': '75',
                                      'types': ['street_number']},
                                     {'long_name': 'Doctor Annie Besant Road',
                                      'short_name': 'Dr Annie Besant Rd',
                                      'types': ['route']},
                                     {'long_name': 'Bhim Nagar',
                                      'short_name': 'Bhim Nagar',
                                      'types': ['political',
                                                'sublocality',
                                                'sublocality_level_2']},
                                     {'long_name': 'Worli',
                                      'short_name': 'Worli',
                                      'types': ['political',
                                                'sublocality',
                                 

In [15]:
# The dict has two main keys - status and results
r_dict.keys()

dict_keys(['status', 'results'])

The ```r_dict['results']``` contains a list of various attributes.

In [16]:
pprint.pprint(r_dict['results'])

[{'address_components': [{'long_name': '75',
                          'short_name': '75',
                          'types': ['street_number']},
                         {'long_name': 'Doctor Annie Besant Road',
                          'short_name': 'Dr Annie Besant Rd',
                          'types': ['route']},
                         {'long_name': 'Bhim Nagar',
                          'short_name': 'Bhim Nagar',
                          'types': ['political',
                                    'sublocality',
                                    'sublocality_level_2']},
                         {'long_name': 'Worli',
                          'short_name': 'Worli',
                          'types': ['political',
                                    'sublocality',
                                    'sublocality_level_1']},
                         {'long_name': 'Mumbai',
                          'short_name': 'Mumbai',
                          'types': ['locality', 'poli

On closer inspection, you'll see that the latitude is contained in ```r_dict['results'][0]['geometry']['location']['lat']``` and the longitude in ```r_dict['results'][0]['geometry']['location']['lng']```.

In [17]:
lat = r_dict['results'][0]['geometry']['location']['lat']
lng = r_dict['results'][0]['geometry']['location']['lng']

print((lat, lng))

(18.9947946, 72.81638699999999)


To summarise, the procedure for getting lat-long coordinates from an address is as follows:
- Convert the address to a suitable format and connect to the Google Maps URL using your key
- Get a response from the API and convert it into a dict using ```json.loads(r.text)```
- Get the lat-long corrdinates using ```lat = r_dict['results'][0]['geometry']['location']['lat']``` and analogous for longitude

**Writing a Function for this Procedure**

Since you may need to do this multiple times, let's write a function which takes in a user-defined address, converts it into a suitable format, and returns the lat-long coordinates as a tuple.



In [18]:
# Input to the fn: Address in standard human-readable form
# Output: Tuple (lat, lng)

api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"

def address_to_latlong(address):
    # convert address to the form x+y+z
    split_address = address.split(" ")
    address = "+".join(split_address)
    
    # pass the address to the URL
    url = "https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}".format(address, api_key)
    
    # connect to the URL, get response and convert to dict
    r = requests.get(url)
    r_dict = json.loads(r.text)
    lat = r_dict['results'][0]['geometry']['location']['lat']
    lng = r_dict['results'][0]['geometry']['location']['lng']
    
    return (lat, lng)
    

# getting some coordinates
print(address_to_latlong("UpGrad, Nishuvi Building, Worli, Mumbai"))
print(address_to_latlong("IIIT Bangalore, Electronic City, Bangalore"))


(18.9947946, 72.81638699999999)
(12.8447512, 77.6632317)


Now, what can be a practical use case of using a geocoding API in data analysis? 

Say you are working in an ecommerce retail company, and you have a dataframe containing a list of customer addresses. Your logistics team wants to identify clusters of customers staying close by, so that they can plan the deliveries accordingly.

We have taken some real addresses an examples below. They are stored in a dataframe, and you want to add a column containing the (lat, lng) of each address. 


In [19]:
# Importing addresses file
add = pd.read_csv("addresses.txt", sep="\t", header = None)
add.head()


Unnamed: 0,0
0,"777 Brockton Avenue, Abington MA 2351"
1,"30 Memorial Drive, Avon MA 2322"
2,"250 Hartford Avenue, Bellingham MA 2019"
3,"700 Oak Street, Brockton MA 2301"
4,"66-4 Parkhurst Rd, Chelmsford MA 1824"


In [20]:
# renaming the column
add = add.rename(columns={0:'address'})
add.head()

Unnamed: 0,address
0,"777 Brockton Avenue, Abington MA 2351"
1,"30 Memorial Drive, Avon MA 2322"
2,"250 Hartford Avenue, Bellingham MA 2019"
3,"700 Oak Street, Brockton MA 2301"
4,"66-4 Parkhurst Rd, Chelmsford MA 1824"


We can now apply the function ```address_to_latlong()``` to the entire column of the dataframe. Since the function takes a lot of time, we'll only apply the function to the first few rows.

In [21]:
add.head()['address'].apply(address_to_latlong)

0    (42.0962778, -70.96858759999999)
1           (42.1211178, -71.0301073)
2           (42.1162461, -71.4652942)
3            (42.098104, -71.0567428)
4           (42.6166586, -71.3636172)
Name: address, dtype: object

You now have the coordinates of all the addresses which you can store in a new column, and write programs to cluster addresses that are close by together.

### Reading PDF Files in Python

Reading PDF files is not as straightforward as reading text or delimited files, since PDFs often contain images, tables, etc. PDFs are mainly designed to be human-readable, and thus you need special libraries to read them in python (or any other programming language).

Luckily, there are some really good libraries in Python. We will use ```PyPDF2``` to read PDFs in python, since it is easy to use and works with *most* types of PDFs. 

Note that python will only be able to read text from PDFs, not images, tables etc. (though that is possible using other specialised libraries).

You can install ```PyPDF2``` using ```pip install PyPDF2```.


For this illustration, we will read a PDF of the book 'Animal Farm' written by George Orwell. 


In [22]:
import PyPDF2

# reading the pdf file
pdf_object = open('animal_farm.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_object)

# Number of pages in the PDF file
print(pdf_reader.numPages)

# get a certain page's text
page_object = pdf_reader.getPage(5)

# Extract text from the page_object
print(page_object.extractText())

55
Cowsandhorses,geeseandturkeys,
Allmusttoilforfreedom'ssake.
BeastsofEngland,beastsofIreland,
Beastsofeverylandandclime,
Hearkenwellandspreadmytidings
Ofthegoldenfuturetime.
Thesingingofthissongthrewtheanimalsintothewildestexcitement.
AlmostbeforeMajorhadreachedtheend,theyhadbegunsingingitforthem-
selves.Eventhestupidestofthemhadalreadypickedupthetuneandafewof
thewords,andasforthecleverones,suchasthepigsanddogs,theyhadthe
entiresongbyheartwithinafewminutes.Andthen,afterafewpreliminary
tries,thewholefarmburstoutinto
BeastsofEngland
intremendousunison.
Thecowslowedit,thedogswhinedit,thesheepbleatedit,thehorseswhinnied
it,theducksquackedit.Theyweresodelightedwiththesongthattheysang
itrightthroughetimesinsuccession,andmighthavecontinuedsingingitall
nightiftheyhadnotbeeninterrupted.
Unfortunately,theuproarawokeMr.Jones,whosprangoutofbed,making
surethattherewasafoxintheyard.Heseizedthegunwhichalwaysstoodina
cornerofhisbedroom,andletyachargeofnumber6shotintothedarkness.
Thepelletsburiedthem

