# APIs and Databases
## A very superficial intro

In this notebook we will explore how we can get data using APIs as well as where/how this data can be stored and processed.

We will look at the following technologies:

The requests library to work with REST APIs
JSON - the most commonly used format for *unstructured* data (don't confuse with unstructured as in text or images)
MongoDB - a popular NoSQL that natively handles JSON type data (we will be using MLab in the cloud rather than a local installation)

Finally we will also have a look at SQL. For better or worse SQL type databases are still around and will be around in the foreseable future. Therefore, we need to get some basics.

**For this tutorial we will need access to a MongoDB instance**

If you like to use it you can install the free [MongoDB community edition on your machine](https://docs.mongodb.com/manual/administration/install-community/)

However, it is much much easier and faster (for now) to use a hosted online version. You can get 500mb free space to play around from mLab https://mlab.com/

In [None]:
import requests as rq
import json
import time, random

In [None]:
import pandas as pd

In [None]:
from pymongo import MongoClient
# c = MongoClient() if you connect locally

# Please enter your credentials in the different fields
connection = MongoClient('ds149672.mlab.com', 49672)
db = connection['sds-teaching']
db.authenticate('sds', 'sdsaau2018')

### MongoDB
MongoDB is as mentioned NoSQL which means that it uses a hierarchical structure format (for the lack of a better expression). It stores BSON (binary JSON files) as so called documents within collections within Databases.
Why is that great?

There is no schema and you can basically drop arbitrary JSON chunks into a MongoDB collection

![](https://docs.mongodb.com/manual/_images/data-model-denormalized.bakedsvg.svg)

JSON data is overall equal to Python dictionaries and thus collections of key-value pairs with nested other dictionaries and/or lists.

### Requests

The requests library allows us to interact with APIs by making GET or POST calls. Every time you post something on e.g. Facebook, your phone is making a POST requests to and Facebook API endpoint sending the text or picture along with some metadata. When obtaining data we mostly use GET requests (which is kind of logical). Actually we can use the requests GET with any kind of URL, and will receive whatever is hiding behind this URL (usually some HTML output) sent back by the server.

Note, that recurrent requests are heavy on servers and generate traffic. People runnig pages don not like that. Therefore, be nice and build in some sleep-timers into your loops when running many requests on some page. OR THE'LL BAN YOU!!!

In [None]:
response = rq.get('https://nomadlist.com/@trevorgerhardt.json')

In [None]:
response.content

In [None]:
response_json = json.loads(response.content)

In [None]:
type(response_json)

In [None]:
response_json.keys()

Let's bring our data into MongoDB

Most important commands for you:


```
collection = db.collection

collection.insert_one(some_dict)
collection.insert_many(sequence of dicts) # you can also pass a pandas dataframe as a list of dictionaries with .to_dict() attached

collection.count

collection.find_one()

cursor = collection.find()

```

In [None]:
# We'll create a new collection
people1 = db.people1

In [None]:
# And put in the parsed JSON
people1.insert_one(response_json)

In [None]:
# Is it in there?

people1.find_one()

Let's get some more data in and automize the "harvesting"
We can for example extract the list of all followers of our initial person
Turns out the uuids can also be used in the Nomadlist API

In [None]:
# Let's make a list of ids of people that we would like to take out of the DB
harvestlist = response_json['followers'].keys()

In [None]:
# A API friendly loop to extract the data for our 40 people

for i in harvestlist:
    q = 'https://nomadlist.com/'+str(i)+'.json' # contructs the query for the GET call
    res = rq.get(q) # grab the data form the API
    if res.status_code in [502,404]: # securety measures. Continue the loop in case an error pops up
        continue
    people1.insert_one(json.loads(res.content)) # put the data into the DB
    time.sleep(random.uniform(0.5,1)) # chill between 0.5 and 1 sec. Primitively simulate human behaviour.

In [None]:
people1.count() #did it work?

In [None]:
cursor = people1.find() # Now we have the data we can take it out

In [None]:
cursor.next()

In [None]:
# We can be a bit more selective
cursor = people1.find({'location.now.country':'Indonesia'},{'_id':0,'username':1,'stats':1}) 

As you can see the query construction in mongo is very different from what you have seen in Python or R or what you'll find in SQL. It is all {} and not really nice. But that is to some extent due to the fact that Mongo is mostly by machines for machines. Something you'll have to learn (and/or look up) if you want to work with MongoDB.

In [None]:
# Creating a pandas DF from a Mongo cursor is however not difficult.
indonesia_df = pd.DataFrame(list(cursor))

In [None]:
# We can also unpack nested dictionaries (here the stats column)
pd.DataFrame([x[1] for x in indonesia_df.stats.iteritems()])

Mongo has many integrated complex functions for working with "BigData". Why not inside Pandas? A database will handle data on disk rather than in memory, index things for fast access and much more. 

One really useful but unfortunately complex (I have to look it up every time I use it) is aggregation of nested elements. 

MongoDB works with so called aggregation pipelines with a killer syntax :-/ 

In the following we will try to unpack or "unwind" the trips that are nested within every user-document. Why would we do that? Because you would like to analyse travel behavior on the micro level (individual trips).
Want to know more? https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/

Below we will create one of these pipelines combining match (a filtering function), project (for selecting what should be returned) and unwind (for disaggregation of nested arrays).

In [None]:
# Return only trips of people that are in Indonesia at the moment (strange query but not wrong)
cursor = people1.aggregate([{'$match':{'location.now.country':'Indonesia'}},
                            {'$project':{'_id':0,'username':1,'trips':1}},
                            {'$unwind':'$trips'}])

In [None]:
# Or just return all trips
cursor = people1.aggregate([{'$project':{'_id':0,'username':1,'trips':1}},
                            {'$unwind':'$trips'}])

In [None]:
len(list(cursor))

Unfortunately we cannot pass this directly to pandas and will have to unpack a bit using a simple loop

In [None]:
# Unpacking the returned documents. Basically we just need to enter the "trips" key. We also add the username.
trips = []
while cursor:
    doc = cursor.next()
    trip = doc['trips']
    trip['username'] = doc['username']
    trips.append(trip)

In [None]:
# Now we can create a dataframe
trips_df = pd.DataFrame(trips)

In [None]:
trips_df.info()

In [None]:
trips_df.columns

### Moving on to SQL

We will be using SQLite, a very simple SQL database (often used in mobile devices). Not as powerful as PosgreSQL or MySQL but easier to work with. 

In [None]:
# First we need to import the sqlite driver
import sqlite3

In [None]:
# Establish a connection and create a DB file on disk
db = sqlite3.connect('db_training.db', check_same_thread=False)

In [None]:
# We can actually write directly from Pandas to SQL

trips_df[['country', 'country_code', 'country_slug', 'date_end', 'date_start',
       'epoch_end', 'epoch_start', 'latitude', 'length', 'longitude', 'place', 'place_photo', 'place_slug', 'place_url',
       'user_photo', 'username']].to_sql('trips', db)

In [None]:
# Let's read a bit manually
# First we find out which tables we can ses in the connected DB
cursor = db.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
cursor.fetchall()

In [None]:
# Find trips to Indonesia

cursor = db.cursor()
cursor.execute("""SELECT * FROM trips where country = 'Indonesia'""")

In [None]:
len(cursor.fetchall())

In [None]:
# We can pass the cursor directly to Pandas (similar to MongoDB)
indonesia_df1 = pd.DataFrame(cursor.fetchall())

In [None]:
indonesia_df1

In [None]:
# We can also ask pandas to perform the query for us
indonesia_df2 = pd.read_sql_query("""SELECT * FROM trips where country = 'Indonesia'""", db)

In [None]:
indonesia_df

- If you want to learn more about working with SQL: https://www.dataquest.io/blog/python-pandas-databases/
- There is also a great intro course on Datacamp: https://www.datacamp.com/courses/intro-to-sql-for-data-science

In [None]:
# Get a cursor object
cursor = db.cursor()
cursor.execute('''
    CREATE TABLE trips_mapping('index' INTEGER PRIMARY KEY, place_slug TEXT)
''')

In [None]:
for i in indonesia_df.iterrows():
    insert = i[1][['index','place_slug']]
    cursor.execute('''INSERT INTO trips_mapping('index', place_slug) VALUES(?,?)''', tuple(insert))

In [None]:
pd.read_sql_query("""SELECT * FROM trips_mapping""", db)

In [None]:
# Close DB when finished
db.close()