# API Authentication: User vs Application
For most APIs, it is important to note the distinction between **authentication types.** There are two types of [authentication calls](https://twython.readthedocs.io/en/latest/usage/starting_out.html) that can be made and they influence the downstream functionality of the module. They are:
* **User Authentication:** user authenticated calls for direct interactions with users (e.g. tweeting, following people, sending DMs, etc.)
* **Application Authentication:** application authenticated calls for making read-only calls to Twitter (e.g. searching, reading a public users timeline)

Below we will only be handling application authentication and subsequent calls, as this is likely the use case for data analysis (e.g. gather data, store offline, analyze offline).

# Facebook API
To use Facebook's API, you will first need to [register](https://developers.facebook.com/) for an application with Facebook. Once there, create and new app and fill out the requisite details (e.g. name, description, website, etc.). 

After registering, you will need your **App ID** and **App Secret**, which will be used to authorize your API calls. Be sure to save in a protected manner. 

## Searching Facebook
The primary function with an OAuth2-authenticated Twitter instance is the search function, which returns data related to the search from the Facebook graph structure. Before detailing the search function, it's worth first going over the Facebook graph API structure.

### Graph API
From the Facebook devs, "the Graph API is the primary way to get data out of, and put data into, Facebook's platform." Though the graph structure has been thoroughly described [elsewhere](https://developers.facebook.com/docs/graph-api/overview), the concise explanation is that the Graph API is structured akin to a social network in that it is comprised of the following:
* **Nodes:** top-level classes, such as Users, Photos, Pages, and Comments
* **Edges:** relationships between those classes, such as a Page's Photos, or a Photo's Comments
* **Fields:** information about individual instances of those classes, such as a User's birthday, or a Page's title

When searching the Facebook graph API, we identify search-relevant nodes and in turn store its associated data (fields) and identify related content via its connections (edges).

### Search Types
Facebook has [documented](https://developers.facebook.com/docs/graph-api/using-graph-api#search) the list of searchable node types and this list is recreated below:

|Type|Description|q value|
|----|-----------|-------|
|user|Search for a person (if they allow their name to be searched for).|Name|
|page|Search for a page.|Name|
|event|Search for an event.|Name|
|group|Search for a group.|Name|
|place|Search for a place. You can narrow your search to a specific location and distance by adding the center parameter (with latitude and longitude) and an optional distance parameter (in meters).|Name|
|placetopic|Returns a list of possible place Page topics and their IDs. Use with topic_filter=all parameter to get the full list.|None|
|ad_*|A collection of different search options that can be used to find targeting options.|See [Targeting Options](https://developers.facebook.com/docs/graph-api/reference/v2.9/targeting)|

### Function Call 
The [search function](https://developers.facebook.com/docs/graph-api/using-graph-api) accepts the following parameters:

|Name|Required|Description|Example|
|----|--------|-----------|-------|
|type|Yes|The type of Facebook class to be searched|See above|
|q|Yes|The content of the search query| Sam, python, doggos|
|before|No|This is the cursor that points to the start of the page of data that has been returned.|ID|
|after|No|This is the cursor that points to the end of the page of data that has been returned.|ID|
|limit|No|This is the maximum number of objects that may be returned. A query may return fewer than the value of limit due to filtering.|15|
|next|No|The Graph API endpoint that will return the next page of data. If not included, this is the last page of data. Due to how pagination works with visibility and privacy, it is possible that a page may be empty but contain a 'next' paging link. Stop paging when the 'next' link no longer appears.|URL|
|previous|No|The Graph API endpoint that will return the previous page of data. If not included, this is the first page of data.|URL|
|fields|No|Function returns only those fields specified.|id, name, picture|


## Facebook-SDK
From its [documentation page](https://facebook-sdk.readthedocs.io/en/latest/index.html):
>This client library is designed to support the Facebook Graph API and the official Facebook JavaScript SDK, which is the canonical way to implement Facebook authentication. You can read more about the Graph API by accessing its official documentation.

To install the latest version of Facebook-SDK (v3.0 at the time of writing), simply open a Terminal and input: 

\> pip install -e git+https://github.com/mobolic/facebook-sdk.git#egg=facebook-sdk


### Initializing Facebook Instance

In [1]:
import facebook
import numpy as np

## Load and store authentication keys.
oath = np.load('facebook_oath.npz')
APP_ID = oath['app_id'].astype(str).tolist()
APP_SECRET = oath['app_secret'].astype(str).tolist()

## Initialize Facebook graph object to receive ACCESS_TOKEN.
graph = facebook.GraphAPI(version='2.7')

## Store and save ACCESS_TOKEN.
ACCESS_TOKEN = graph.get_app_access_token(APP_ID, APP_SECRET)

## Re-initilize Facebook graph object with ACCESS_TOKEN to use it.
graph = facebook.GraphAPI(access_token=ACCESS_TOKEN, version='2.9')

### Search Querying

In [2]:
## Search for all Facebook pages mentioning python.
results = graph.search(type='page', q='python', limit=5)
results

{'data': [{'id': '547581871944926', 'name': 'Python Developers'},
  {'id': '509724922449953', 'name': 'Python Tips'},
  {'id': '100832693336856', 'name': 'Python'},
  {'id': '346004702246198', 'name': 'Python'},
  {'id': '1539401562941064', 'name': 'Django - Python'}],
 'paging': {'cursors': {'after': 'NAZDZD', 'before': 'MAZDZD'},
  'next': 'https://graph.facebook.com/v2.9/search?access_token=1415649088531508%7Cjq-ce7ouUFY_B2ePp6dQ9ob2_zc&q=python&type=page&limit=5&after=NAZDZD'}}

Query results are stored in a dictionary format. At the top level, there are two keys:
* **data:** a list of dictionaries containg the results of the search.
* **paging:** a dictionary containing information about the query itself, including the **after**, **before**, and **next** pieces of metadata which are necessary for iterative function calls.

Once a particular node in the graph API has been identified, its ID can be used to query additional features of the node, including its metadat and connections. We will only highlight a few examples below, but there is [extensive documentation](https://developers.facebook.com/docs/graph-api/reference) of the fields and connections that can be searched for a particular node type.

In [3]:
## Extract first page and corresponding ID.
page = results['data'][0]
page_id = page['id']

## Extract additional metadata from page.
metadata = graph.get_object(page_id, fields='id,name,about,link')
print('Page Name: %s' %metadata['name'])
print('Page ID: %s' %metadata['id'])
print('Page Link: %s' %metadata['link'])
print('Page Description: %s' %metadata['about'])

## Extract posts from this page.
post_results = graph.get_connections(page_id, connection_name='posts', limit=3)
post_results['data']

Page Name: Python Developers
Page ID: 547581871944926
Page Link: https://www.facebook.com/PythonDevelopers/
Page Description: Everything about python

For beginners, try this book http://greenteapress.com/thinkpython2/thinkpython2.pdf


[{'created_time': '2017-04-16T04:30:09+0000',
  'id': '547581871944926_1286904234679349',
  'story': "Python Developers shared Code.org's post."},
 {'created_time': '2017-04-05T23:21:15+0000',
  'id': '547581871944926_1277468192289620',
  'message': 'Cool code for running python with grub bootloader :)\nhttps://github.com/biosbits/bits'},
 {'created_time': '2017-03-17T13:13:29+0000',
  'id': '547581871944926_1260813420621764',
  'message': 'Firebase + Python https://github.com/thisbejim/Pyrebase'}]

### Aggregating Search Results
The list of dictionaries can very easily be amalgamated into a single Pandas DataFrame.

In [4]:
from pandas import Series, DataFrame

## Convert each status dictionary into a Pandas Series.
posts = [Series(post) for post in post_results['data']]

## Merge into single DataFrame
df = DataFrame(posts)
df.head()

Unnamed: 0,created_time,id,message,story
0,2017-04-16T04:30:09+0000,547581871944926_1286904234679349,,Python Developers shared Code.org's post.
1,2017-04-05T23:21:15+0000,547581871944926_1277468192289620,Cool code for running python with grub bootloa...,
2,2017-03-17T13:13:29+0000,547581871944926_1260813420621764,Firebase + Python https://github.com/thisbejim...,


### Putting it all together
Due to the graph-like nature of Facebook, data organization/storage is a non-trivial design problem and shoudl reflect the ultimate analytic goals. For example, if one is interested only in text mining the comments of a Facebook page without caring about the structure of post-comment and user-comment relationships, then a flat structure may be appropriate. If more complex tree-like relations are desired, then nested data structures (e.g. nested file directories, JSON, XML, etc.) may be necesary. 

In any event, an example script is given below that mines comments from Barack Obama's Facebook Page. This script also obeys Facebook's [rate limit](https://developers.facebook.com/docs/graph-api/advanced/rate-limiting) of 200 requests/hr.

In [5]:
import facebook
import numpy as np
from pandas import Series, DataFrame

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Authentication.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Load and store authentication keys.
oath = np.load('facebook_oath.npz')
APP_ID = oath['app_id'].astype(str).tolist()
APP_SECRET = oath['app_secret'].astype(str).tolist()

## Initialize Facebook graph object to receive ACCESS_TOKEN.
graph = facebook.GraphAPI(version='2.7')

## Store and save ACCESS_TOKEN.
ACCESS_TOKEN = graph.get_app_access_token(APP_ID, APP_SECRET)

## Re-initilize Facebook graph object with ACCESS_TOKEN to use it.
graph = facebook.GraphAPI(access_token=ACCESS_TOKEN, version='2.9')

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Find top pages.
results = graph.search(type='page', q='Barack Obama', limit=5)
top_result = results['data'][0]
print('Top results: %s (ID = %s)' %(top_result['name'], top_result['id']))

## Request page metadata.
metadata = graph.get_object(top_result['id'], fields='id,name,about,link')

## Request five most recent posts.
post_results = graph.get_connections(top_result['id'], connection_name='posts', limit=5)

## Iteratively request and store the first 
## five comments from each post.
df = []
for post in post_results['data']:
    
    ## Request comments.
    comment_results = graph.get_connections(post['id'], connection_name='comments', limit=5,
                                            fields='id,created_time,message,like_count')
    
    ## Iteratively convert comments to Series. Store.
    comments = [Series(comment) for comment in comment_results['data']]
    df += comments
    
## Convert into DataFrame.
df = DataFrame(df)
df.head(5)

Top results: Barack Obama (ID = 6815841748)


Unnamed: 0,created_time,id,like_count,message
0,2017-01-10T23:49:31+0000,10154508876046749_10154508879381749,13088,As a young Republican. We don't always see eye...
1,2017-01-10T23:51:55+0000,10154508876046749_10154508884101749,5875,Even tho I dont agree with everything presiden...
2,2017-01-10T23:49:24+0000,10154508876046749_10154508879121749,4647,"So sad, especially considering who is replacin..."
3,2017-01-10T23:50:14+0000,10154508876046749_10154508880516749,3186,I'm gonna miss you Barack Obama I wish you cou...
4,2017-01-10T23:51:03+0000,10154508876046749_10154508882091749,3124,I am so very sorry for how Congress treated yo...


# Twitter API
To use Twitter's API, you will first need to [register](https://apps.twitter.com/) for an application with Twitter. Once there, create and new app and fill out the requisite details (e.g. name, description, website, etc.). 

After registering, you will need your **Consumer Key** and **Consumuer Secret**, which will be used to authorize your API calls. Be sure to save in a protected manner. 

## Searching Twitter
The primary function with an OAuth2-authenticated Twitter instance is the search function, which returns relevant Tweets based on a query. There is substantial documentation by Twitter on this function, including:
* Search function [docstring](https://dev.twitter.com/rest/reference/get/search/tweets)
* Search function [syntax](https://dev.twitter.com/rest/public/search)
* Search function [best uses](https://dev.twitter.com/rest/public/timelines)
* Search function [rate limits](https://dev.twitter.com/rest/public/rate-limiting)

### Function Call 
The search function accepts the following parameters:

|Name|Required|Description|Example|
|----|--------|-----------|-------|
|q|Yes|A UTF-8, URL-encoded search query of 500 characters maximum, including operators. Queries may additionally be limited by complexity.| @szorowi1, python, #qmss|
|geocode|No|Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile. The parameter value is specified by ” latitude,longitude,radius ”, where radius units must be specified as either ” mi ” (miles) or ” km ” (kilometers).|40.81207 -73.954377 1mi|
|lang|No|Restricts tweets to the given language, given by an [ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) code.|eu, en|
|result_type|No|Specifies what type of search results you would prefer to receive.|recent, popular, mixed (default)|
|count|No|The number of tweets to return per page, up to a maximum of 100.|15 (default)
|until|No|Returns tweets created before the given date. Date should be formatted as YYYY-MM-DD. Keep in mind that the search index has a **7-day limit**: no tweets will be found for a date older than one week.|2015-07-19|
|since_id|No|Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.|12345|
|max_id|No|Returns results with an ID less than (that is, older than) or equal to the specified ID.|54321|

### Search Range
The **max_id** and **since_id** flags are important due to the [dynamic nature](https://dev.twitter.com/rest/public/timelines) of Twitter posts. In other words, the problem with Twitter is that new posts are constantly being added. Without properly specifing flags demarcating the start- and stop-points for searching, the risk of storing duplicate Tweets is high. To combat this when making multiple Search queries, the **ID - 1** of the most recently stored Tweet should be passed as the max_id flag for the next query. In this way, the next batch of queried Tweets will begin where the previous batch left off. The **since_id** flag can similarly be used to only add Tweets newer than the previously newest Tweet collected. For more details (and a more complete explanation, see [here](https://dev.twitter.com/rest/public/timelines)).

### Syntax
Twitter has conveniently [documented](https://dev.twitter.com/rest/public/search) the syntax of search queries. Some examples are given below:

|Operator|Finds Tweets|
|--------|------------|
|watching now|containing both “watching” and “now”. This is the default operator.|
|“happy hour”|containing the exact phrase “happy hour”.|
|love OR hate|containing either “love” or “hate” (or both).
|beer -root|containing “beer” but not “root”.
|#haiku|containing the hashtag “haiku”.
|from:interior|sent from Twitter account “interior”.
|to:NASA|a Tweet authored in reply to Twitter account “NASA”.
|@NASA|mentioning Twitter account “NASA”.
|puppy filter:media|containing “puppy” and an image or video.
|puppy -filter:retweets|containing “puppy”, filtering out retweets|
|hilarious filter:links|containing “hilarious” and linking to URL.
|puppy url:amazon|containing “puppy” and a URL with the word “amazon” anywhere within it.
|superhero since:2015-12-21|containing “superhero” and sent since date “2015-12-21” (year-month-day).
|puppy until:2015-12-21|containing “puppy” and sent before the date “2015-12-21”.
|movie -scary :)|containing “movie”, but not “scary”, and with a positive attitude.
|flight :(|containing “flight” and with a negative attitude.
|traffic ?|containing “traffic” and asking a question.|

### URL Encoded Queries
Twitter requests that all search queries be [URL-encoded](https://en.wikipedia.org/wiki/Percent-encoding). Fortunately, this is not difficult with the **urllib** python library.

\> import urllib <br\>> urllib.parse.quote_plus("query")

## Twython
From its [Github](https://github.com/ryanmcgrath/twython):
>Twython is the premier Python library providing an easy (and up-to-date) way to access Twitter data. Actively maintained and featuring support for Python 2.6+ and Python 3. It's been battle tested by companies, educational institutions and individuals alike. 

To install Twython, simply open a Terminal and input: 

\> pip install python

The following examples will only provide a basic introduction to the steps necessary for mining Twitter data. For further reading, there are also a number of tutorials [here](http://nodotcom.org/python-twitter-tutorial.html), [here](http://socialmedia-class.org/twittertutorial.html), and [here](http://adilmoujahid.com/posts/2014/07/twitter-analytics/). 

### Initializing Twitter Instance

In [6]:
import numpy as np
from twython import Twython

## Load and store authentication keys.
oath = np.load('twitter_oath.npz')
APP_KEY = oath['consumer_key']
APP_SECRET = oath['consumer_secret']

## Initialize twitter object to receive ACCESS_TOKEN.
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)

## Store and save ACCESS_TOKEN.
ACCESS_TOKEN = twitter.obtain_access_token()

## Re-initilize Twitter object with ACCESS_TOKEN to use it.
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)

### Search Querying

In [7]:
import urllib

## Search for Tweets using the hashtag python,
## with positive connotation.
query = '#python :)'

## URL-encode query.
query = urllib.parse.quote_plus(query)

## Make function call.
results = twitter.search(q=query, lang='en')
type(results)

dict

Query results are stored in a dictionary format. At the top level, there are two keys:
* **statuses:** a list of dictionaries containg the results of the search.
* **search_metadata:** a dictionary containing information about the query itself, including the max_id and since_id

Each dictionary in "statuses" contains a large amount of information, including: 
* **Tweet ID:** id
* **Tweet creation date:** created_at
* **Tweet text:** text
* **Retweet count:** retweet_count
* **Favorite count:** favorite_count
* **Coordinates:** coordinates (if available). 

"statuses" also contains several additional keys that store further dictionaries of information, including: 
* **User information:** user
* **Tweet metadata:** metadata
* **Content of retweeted status:** retweeted_status (if applicable)

### Aggregating Search Results
The list of dictionaries can very easily be amalgamated into a single Pandas DataFrame.

In [8]:
from pandas import Series, DataFrame

## Convert each status dictionary into a Pandas Series.
statuses = [Series(status) for status in results['statuses']]

## Merge into single DataFrame
df = DataFrame(statuses)

## Drop unnecessary columns.
df = df.drop(['entities', 'extended_entities', 'id_str', 'metadata', 'retweeted_status', 'source'], 1)

df.head(5)

Unnamed: 0,contributors,coordinates,created_at,favorite_count,favorited,geo,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,...,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,retweet_count,retweeted,text,truncated,user
0,,,Sat Jun 10 03:45:39 +0000 2017,0,False,,873385646454386692,,,,...,,False,en,,,1,False,RT @craigbrownphd: #MachineLearning #Spark #Py...,False,"{'id': 619148364, 'id_str': '619148364', 'name..."
1,,,Sat Jun 10 03:45:12 +0000 2017,0,False,,873385531266277377,,,,...,,False,en,,False,0,False,Get started as a #Maker by signing up for our ...,True,"{'id': 3925032559, 'id_str': '3925032559', 'na..."
2,,,Sat Jun 10 03:45:08 +0000 2017,0,False,,873385517139865600,,,,...,,False,en,,False,0,False,This Hours Photo: #weather #minnesota #photo #...,False,"{'id': 821038569791819776, 'id_str': '82103856..."
3,,,Sat Jun 10 03:45:03 +0000 2017,0,False,,873385496654880768,,,,...,,False,en,,False,11,False,RT @DataCamp: Infographic: switching from #web...,False,"{'id': 4619684853, 'id_str': '4619684853', 'na..."
4,,,Sat Jun 10 03:45:03 +0000 2017,0,False,,873385495367229441,,,,...,,False,en,,,0,False,"Trend Crawler: added 4 trending topics to DB, ...",False,"{'id': 855036706159984640, 'id_str': '85503670..."


It is especially easy now to lookup and access the content of Tweets.

In [9]:
df.text

0     RT @craigbrownphd: #MachineLearning #Spark #Py...
1     Get started as a #Maker by signing up for our ...
2     This Hours Photo: #weather #minnesota #photo #...
3     RT @DataCamp: Infographic: switching from #web...
4     Trend Crawler: added 4 trending topics to DB, ...
5     @JackieKazil @Podcast__init__ Here's @JackieKa...
6     RT @AdamSmitht1: #python sklearn install error...
7     #python sklearn install error can not be found...
8     RT @gotynker: Let's beat the summer slide! Her...
9     Check out @JackieKazil's inspiring interview o...
10    #MachineLearning #Spark #Python #Java run on #...
11    Python List Comprehension Vs. Map #python #lis...
12    RT @Hakin9: Face Recognition With Python, in U...
13    What is the standard Python docstring format? ...
14    Why use pip over easy_install? #python #pip #s...
Name: text, dtype: object

As can be observed, several of the secondary dictionaries (e.g. entities, metadata, user) are stored as such in their respective columns. These can be deleted, or new functions can be written to extract relevant information.

In [10]:
## Define new user extract function.
extract_user_info = np.vectorize( lambda d: d['id'] if isinstance(d,dict) else np.nan )

## Apply.
df.user = extract_user_info(df.user)
df.head(5)

Unnamed: 0,contributors,coordinates,created_at,favorite_count,favorited,geo,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,...,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,retweet_count,retweeted,text,truncated,user
0,,,Sat Jun 10 03:45:39 +0000 2017,0,False,,873385646454386692,,,,...,,False,en,,,1,False,RT @craigbrownphd: #MachineLearning #Spark #Py...,False,619148364
1,,,Sat Jun 10 03:45:12 +0000 2017,0,False,,873385531266277377,,,,...,,False,en,,False,0,False,Get started as a #Maker by signing up for our ...,True,3925032559
2,,,Sat Jun 10 03:45:08 +0000 2017,0,False,,873385517139865600,,,,...,,False,en,,False,0,False,This Hours Photo: #weather #minnesota #photo #...,False,821038569791819776
3,,,Sat Jun 10 03:45:03 +0000 2017,0,False,,873385496654880768,,,,...,,False,en,,False,11,False,RT @DataCamp: Infographic: switching from #web...,False,4619684853
4,,,Sat Jun 10 03:45:03 +0000 2017,0,False,,873385495367229441,,,,...,,False,en,,,0,False,"Trend Crawler: added 4 trending topics to DB, ...",False,855036706159984640


### Putting it all together
Twitter [rate limits](https://dev.twitter.com/rest/public/rate-limiting) API queries, meaning that we need to be aware of how often we are querying. According to [this table](https://dev.twitter.com/rest/public/rate-limits), it appears the OAuth 2 search function is limited to 450 requests per 15 minute window. In other words, we are limited to one request every 2 seconds. 

Below, we make use of the code previously written and the time library to make 10 search queries merge and store the results while obeying the rate limit.

In [11]:
import os, time, urllib
import numpy as np
from twython import Twython
from pandas import Series, DataFrame, concat

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Authentication.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Load and store authentication keys.
oath = np.load('twitter_oath.npz')
APP_KEY = oath['consumer_key']
APP_SECRET = oath['consumer_secret']

## Initialize twitter object to receive ACCESS_TOKEN.
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)

## Store and save ACCESS_TOKEN.
ACCESS_TOKEN = twitter.obtain_access_token()

## Re-initilize Twitter object with ACCESS_TOKEN to use it.
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Search for Tweens from Columbia University
## filtering out any retweets.
query = 'from:columbia -filter:retweets'

## URL-encode query.
query = urllib.parse.quote_plus(query)

max_id = False
for _ in range(10):
    
    ## Make function call.
    if not max_id: results = twitter.search(q=query, lang='en')
    else: results = twitter.search(q=query, lang='en', max_id = max_id-1)
        
    ## Convert each status dictionary into a Pandas Series.
    statuses = [Series(status) for status in results['statuses']]
    
    ## Merge.
    if not max_id: df = DataFrame(statuses)
    else: df = concat([df,DataFrame(statuses)])
        
    ## Extract max_id
    max_id = results['search_metadata']['max_id']
        
    ## Sleep timer.
    time.sleep(2)

## Display results.
print(df.shape)
df.head(3)

(150, 27)


Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,id_str,...,metadata,place,possibly_sensitive,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,Sat Jun 10 03:44:29 +0000 2017,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 870437204333559811, 'id_str'...",0,False,,873385351292870657,873385351292870657,...,"{'iso_language_code': 'en', 'result_type': 're...",,False,163,False,{'created_at': 'Fri Jun 02 00:29:37 +0000 2017...,"<a href=""http://twitter.com/download/iphone"" r...",RT @MadamMelanin: Kali Uchis | This singer fro...,False,"{'id': 1329443965, 'id_str': '1329443965', 'na..."
1,,,Sat Jun 10 03:43:30 +0000 2017,"{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,,873385106643320836,873385106643320836,...,"{'iso_language_code': 'en', 'result_type': 're...",,False,0,False,,"<a href=""http://publicize.wp.com/"" rel=""nofoll...",2 unaccounted for after aircraft goes missing ...,False,"{'id': 795024244606283776, 'id_str': '79502424..."
2,,,Sat Jun 10 03:42:22 +0000 2017,"{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,,873384819970977793,873384819970977793,...,"{'iso_language_code': 'en', 'result_type': 're...",,,6860,False,{'created_at': 'Thu Jun 08 16:16:27 +0000 2017...,"<a href=""http://twitter.com/download/iphone"" r...",RT @AlanDersh: Senators should ask Comey the n...,False,"{'id': 3540150733, 'id_str': '3540150733', 'na..."


# Reddit API
To use Reddit's API, you will first need to [register](https://github.com/reddit/reddit/wiki/OAuth2) for an application with Twitter. Once there, create a new app and fill out the requisite details (e.g. name, description, etc.). 

After registering, you will need your **Client ID** and **Consumuer Secret**, which will be used to authorize your API calls. Be sure to save in a protected manner. In addition, you will need to define a **user agent** code that follows the naming conventions outlined [here](https://github.com/reddit/reddit/wiki/API#rules).

The complete [Reddit API](https://www.reddit.com/dev/api/) is a helpful resource, but is predominantly full of information of user-authenticated services. We will largely focus on OAuth 2 operations, i.e. searching through publicly accessible Reddit content.

## PRAW
The Python Reddit API Wrapper, or [PRAW](https://github.com/praw-dev/praw), is:
>... a python package that allows for simple access to Reddit's API. PRAW aims to be easy to use and internally follows all of Reddit's API rules. With PRAW there's no need to introduce sleep calls in your code. Give your client an appropriate user agent and you're set.

To install PRAW, simply open a terminal and input:

\> pip install praw

For further reading, there are also a number of tutorials [here](https://praw.readthedocs.io/en/latest/tutorials/comments.html), [here](http://pythonforengineers.com/build-a-reddit-bot-part-1/), and [here](http://minimaxir.com/2015/10/reddit-bigquery/).


### Initializing  Reddit Instance

In [12]:
import numpy as np
from praw import Reddit

## Load and store authentication keys.
oath = np.load('reddit_oath.npz')
CLIENT_ID = oath['client_id'].astype(str).tolist()
CLIENT_SECRET = oath['client_secret'].astype(str).tolist()
USER_AGENT = oath['user_agent'].astype(str).tolist()

## Initialize Reddit object to receive ACCESS_TOKEN.
reddit = Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT)
reddit

<praw.reddit.Reddit at 0x7fa6c57955c0>

### Search Querying 
To understand querying for Reddit, one first has to know a little about the organization of Reddit itself. For the uninitiated, Reddit is a forum website comprised of multiple specialty forums, known as **subreddits**. An individual subreddit is, in itself made up of **submissions**, or user posts to the subreddit. Submissions are made of the **body**, or content, of the post and **comments** left in response to the post. **Users** generate submissions and comments, and are also members of different subreddits.

Each of these individual instances (e.g. subreddits, submissions, comments, users) are queryable. The pseudo-hierarchical structure of Reddit, however, means that querying a higher-order class will usually return lower level classes (e.g. querying a subreddit will return submissions, querying a submission will return comments). Reflecting the structure of Reddit, we will cover each type of query in a top-down fashion.

### Querying Subreddits
To search over the space of subreddits (i.e. over all forums), there are two main functions: **subreddits** and **random_subreddit**. The former provides different tools for finding subreddits, including providing a list of default and popular subreddits; the latter randomly selects and returns an instance of one subreddit. If performing search, Reddit's search syntax can be found [here](https://www.reddit.com/wiki/search). Below are examples of both function types.

In [13]:
## Search for subreddits involving avocados.
avocado_subreddits = reddit.subreddits.search('avocado')
avocado_subreddits = list(avocado_subreddits)
avocado_subreddits[:5]

[Subreddit(display_name='avocadosgonewild'),
 Subreddit(display_name='avocado'),
 Subreddit(display_name='avocados'),
 Subreddit(display_name='enlightenedavocadomen'),
 Subreddit(display_name='unexpectedavocado')]

In [14]:
## Return random subreddit.
random_subreddit = reddit.random_subreddit()
random_subreddit

Subreddit(display_name='occupywallstreet')

### Querying Submissions
After identifying a subreddit of interest, submissions are queried. Submissions can be queried directly from a subreddit instance using the **subreddit.submissions** attribute.

Similar to querying for subreddits, querying submissions of a subreddit will return a generator instance. This can be traversed to return submissions to that subreddit. The submissions function has three attributes:
* **start:** A UNIX timestamp indicating the earliest creation time of submission yielded during the call.
* **end:** A UNIX timestamp indicating the latest creation time of a submission yielded during the call
* **extra_query:**  cloudsearch query that will be used to further filter results.

We demonstrate with the r/learnpython subreddit.

In [15]:
import datetime, time

## Initialize subreddit instance.
subreddit = reddit.subreddit('learnpython')

## Define start and end dates 
start = '2017-06-01'
start = time.mktime(datetime.datetime.strptime(start, '%Y-%m-%d').timetuple()) # Convert to UNIX time.
end = '2017-06-02'
end = time.mktime(datetime.datetime.strptime(end, '%Y-%m-%d').timetuple()) # Convert to UNIX time.

## Query submissions.
submissions = subreddit.submissions(start=start, end=end)
submissions = list(submissions)
submissions[:5]

[Submission(id='6es1xr'),
 Submission(id='6erq6p'),
 Submission(id='6ermnb'),
 Submission(id='6er5ub'),
 Submission(id='6er3b9')]

### Querying Bodies & Comments
Once a submission has been identified, its content can be queried. Note from above that submissions are identified by their respective **IDs**. 

At the level of submission there are many stored pieces of data, including: author, title, subreddit, and upvotes/downvotes/score. There is also the body of the submission, stored as its **selftext**. Finally, the comments on the submission can also be investigated. These have similar attributes.

In [16]:
## Take first submission.
submission = submissions[1]

## Print metadata.
print('Author: %s' %submission.author)
print('Title: %s' %submission.title)
print('Score: %s' %submission.score)
print('Body: %s' %submission.selftext)

Author: trexmixx
Title: Running a python file continuously (best practice)
Score: 46
Body: Somewhat new to practical python.  I'm working on making a twitter bot- whats the best practice for keeping a script running continuously?  I've looked into hosting it on a free web server online and getting a raspberry pi and keeping it on - are either of these actually legitimate?  I'm out of my depth.


In [17]:
## Extract comments.
comments = submission.comments
comments = list(comments)

In [18]:
## Take first comment.
comment = comments[0]

## Print metadata.
print('Author: %s' %comment.author)
print('Score: %s' %comment.score)
print('Body: %s' %comment.body)

Author: elbiot
Score: 21
Body: Either of those are fine options. Write a systemd service, since that's how continually running services are usually done on Linux (now that systemd is in almost all of the major distros).


### Putting it all together
Due to the hierarchical nature of Reddit, data organization/storage is a non-trivial design problem and shoudl reflect the ultimate analytic goals. For example, if one is interested only in text mining the submissions of subreddit without caring about the structure of submission-comment and comment-comment relationships, then a flat structure may be appropriate. If more complex tree-like relations are desired, then nested data structures (e.g. nested file directories, JSON, XML, etc.) may be necesary. 

In any event, an example script is given below that mines submissions & comments from r/Anxiety, as part of a larger project of predicting symptoms clusters from text. It should also be noted that PRAW handles Reddit's [rate limit](https://github.com/reddit/reddit/wiki/API#rules) of 60 requests per minute (1 request/sec; [source](https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html)).




In [19]:
import numpy as np
import datetime, time
from praw import Reddit
from pandas import Series, DataFrame

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Authentication.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Load and store authentication keys.
oath = np.load('reddit_oath.npz')
CLIENT_ID = oath['client_id'].astype(str).tolist()
CLIENT_SECRET = oath['client_secret'].astype(str).tolist()
USER_AGENT = oath['user_agent'].astype(str).tolist()

## Initialize Reddit object to receive ACCESS_TOKEN.
reddit = Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Initialize posts (bodies + comments).
posts = []

## Define start and end dates 
start = '2017-06-01'
start = time.mktime(datetime.datetime.strptime(start, '%Y-%m-%d').timetuple()) # Convert to UNIX time.
end = '2017-06-02'
end = time.mktime(datetime.datetime.strptime(end, '%Y-%m-%d').timetuple()) # Convert to UNIX time.

## Load subreddit.
subreddit = reddit.subreddit('Anxiety')

for submission in subreddit.submissions(start=start, end=end):
    
    ## Parse submission.
    series = Series([submission.author, submission.created_utc, submission.score, submission.selftext],
                    index=['User','Date','Score','Text'])
    posts.append(series)
    
    ## Iterate over comments.
    for comment in submission.comments:
        
        ## Parse submission.
        series = Series([comment.author, comment.created_utc, comment.score, comment.body],
                        index=['User','Date','Score','Text'])
        posts.append(series)
        

## Merge.
posts = DataFrame(posts)
posts.head(5)

Unnamed: 0,User,Date,Score,Text
0,iwabo1234,1496376000.0,2,I'm getting my wisdom teeth removed tomorrow (...
1,winning34,1496382000.0,2,"Dental anxiety is a very real thing, but thank..."
2,kajjaznam1,1496408000.0,2,"I took one out 2 days ago, it didn't hurt beca..."
3,Barbar21,1496375000.0,7,So long story short i think i have anxiety iss...
4,solidpancake,1496383000.0,5,Once I heard it described as that feeling you ...


# Other Python API Wrappers
The [Real Python](https://realpython.com/) group has put together an extensive [list of Python API wrappers](https://github.com/realpython/list-of-python-api-wrappers), which includes Amazon Shopping, Amazon Web Services, Craigslist, Dropbox, Ebay, Evernote, FedEx, Flickr, Forecast, Geopy, Github, Google Maps, Google Music, Indeed, Instagram, Linkedin, Medium, NASA, Netflix, RottenTomatoes, Slack, Spotify, StackExchange, Uber, World Bank, Yahoo, and YouTube. See the list for full details including links.

Outside of the list, several miscellaneous Python API wrappers stumbled across while writing this tutorial include the [Federal Election Committee](https://github.com/jeremyjbowers/pyopenfec), [Elsevier](https://github.com/ElsevierDev/elsapy), [Google Scholar](https://github.com/ckreibich/scholar.py), [IMDB](http://imdbpy.sourceforge.net/), and [Last.fm](https://github.com/pylast/pylast).

# Web Scraping

When API access to online content is not available, webscraping is a viable alternative in Python. There are many available packages and tools for webscraping with Python, and at the time of writing there are two  dominant packages: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) and [Scrapy](https://scrapy.org/). 

**Scrapy** is a newer package that specifically handles the webcrawling aspect of webscraping; it was written specifically to easily and efficiently crawl many webpages and store content for later parsing. **BeautifulSoup** is a library designed to parse HTML, CSS, and other web-based programming languages. A more complete treatment of webcrawling and web-based programming languages is beyond the scope of the current tutorial, but a working example of BeautifulSoup is given below. For more detailed introductions to Scrapy and BeautifulSoup, please see their [respective](https://docs.scrapy.org/en/latest/index.html) [documentations](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) as well as these tutorials on [webcrawling](https://docs.scrapy.org/en/latest/intro/tutorial.html#) and [webscraping](https://www.dataquest.io/blog/web-scraping-tutorial-python/).

## BeautifulSoup
In this example, we will use BeautifulSoup to webscrape the table of [complete music ratings](http://www.albumoftheyear.org/ratings/overall/2017/) from the website, [*Album of the Year*](http://www.albumoftheyear.org/).

### Requesting and Parsing HTML
 We will use the **requests** package to request and store the website information and use **BeautifulSoup** to parse the resulting HTML. (To note, Scrapy would replace the requests package in a fuller treatment of the subject.)

In [20]:
import requests
from bs4 import BeautifulSoup

## Define URL for scraping.
url = 'http://www.albumoftheyear.org/ratings/overall/2017/'

## Make request.
request = requests.get(url)

## Define a new BeautifulSoup object from the HTML 
## of the requested website, and specifying lxml
## as the background library for parsing.
soup = BeautifulSoup(request.text, 'lxml')

## Call the prettify function to print the HTML in a
## human readable format.
print(soup.prettify()[:1000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
 <head>
  <title>
   Complete Music Ratings of 2017
  </title>
  <link href="/main.css" rel="stylesheet" type="text/css"/>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <meta content="http://www.albumoftheyear.org/images/fbLogo.jpg" property="og:image"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700" rel="stylesheet"/>
  <link href="http://cdn.albumoftheyear.org/images/favicon.png" rel="icon" type="image/png"/>
  <script data-cfasync="false" src="https://s3.amazonaws.com/pubfig/albumoftheyear/pubfig1.min.js">
  </script>
 </head>
 <body id="graybg">
  <noscript>
   <iframe height="0" src="//www.googletagmanager.com/ns.html?id=GTM-TPD5DD" style="display:none;visibility:hidden" width="0">
   </iframe>
  </noscript>
  <script>
   (function(w,d

Printed above is the first 1000 characters of the HTML making up the [2017 table of complete music rankings](http://www.albumoftheyear.org/ratings/overall/2017/) from AOTY. HTML, or the HyperText Markup Language, is a markup language that tells a browser how to layout content. HTML specifies all of the content of a webpage, as well as its formatting. 

The HTML language is comprised of tags, which are denoted by "</>". There are many different types of tags in HTML and they denote different usages (e.g. `"<p>"` is the paragraph tag, `<div>` is the division tag,  `<a>` is the hyperlink tag). Tags may be passed parameters, which are properties of the tag and modify their function. Finally, HTML has a hierarchical format; in other words, tags are nested within other tags. As can be observed above, the `<title>` tag is nested under the `<head>` tag, and the `<head>` tag is nested under the ultimal `<html>` tag.

The utility of BeautifulSoup is in parsing these nested hierarchical structures. A BeautifulSoup instance provides many functions for searching along and retrieving data from trees of tags. For example, tag types can be called directly from the soup instance. These will return the first tag matching the specified pattern.

In [21]:
## Get the first instance of two tags.
print(soup.a)
print(soup.br)

<a href="/recommendations/">Recommendations</a>
<br clear="all"/>


All of the tags of a particular type can also be queried and returned in a list.

In [22]:
## Find all a-tags, i.e. hyperlinks.
tags = soup.find_all('a')
tags[:5]

[<a href="/recommendations/">Recommendations</a>,
 <a href="/add-album.php" rel="nofollow">Add Album</a>,
 <a href="/account/" rel="nofollow">Sign Up / Login</a>,
 <a href="http://www.facebook.com/albumoftheyear" rel="nofollow" target="_blank"><div class="facebookHead"></div></a>,
 <a href="http://www.twitter.com/aoty" rel="nofollow" target="_blank"><div class="twitterHead"></div></a>]

Indivdual tags have attributes that allow for further information retrieval, including the lookup of tags nested underneath it.

In [23]:
## Extract last tag from list.
tag = tags[-1]

## Print data from tag.
print('Tag attrs: %s' %tag.attrs)
print('Tag href: %s' %tag['href'])
print('Tag conents: %s' %tag.contents)
print('Tag text: %s' %tag.text)

Tag attrs: {'href': 'http://forums.albumoftheyear.org/c/site-feedback'}
Tag href: http://forums.albumoftheyear.org/c/site-feedback
Tag conents: ['Feedback']
Tag text: Feedback


Some study of the HTML structure reveals that the start of each row of the [ratings table](http://www.albumoftheyear.org/ratings/overall/2017/) is denoted by the `<tr>` tag. We will use this information and the *descendants* attribute (which iteratively traverses through the children, grandchildren, great-grandchildren, etc. of the tag) to each the content from each row the data.

In [24]:
rows = []
for link in soup.find_all('tr'):
    rows.append( [descendant for descendant in link.descendants if isinstance(descendant,str)] )

for row in rows[:5]:
    print(row)

['AoTY', 'Artist / Album', 'A.V.', 'AMG', 'CoS', 'DIY', 'DiS', 'OMH', 'NME', 'NoRip', 'Paste', 'P4K', 'PM', 'PMA', 'RS', 'Spin', '405', 'TLOBF', 'Skinny', 'TMT', 'Radar']
['1', '92', 'Kendrick Lamar', 'DAMN.', '100', '90', '91', '100', '90', 'n/a', '80', 'n/a', '91', '92', '90', '100', '90', 'n/a', '95', 'n/a', 'n/a', '100', '80']
['2', '90', 'Mount Eerie', 'A Crow Looked at Me', '91', '80', '91', 'n/a', 'n/a', 'n/a', 'n/a', '90', '92', '90', '100', '83', 'n/a', 'n/a', '90', 'n/a', 'n/a', '100', '80']
['3', '86', 'Arca', 'Arca', '91', '80', '83', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', '85', '90', '83', 'n/a', 'n/a', '75', '85', 'n/a', '80', 'n/a']
['4', '86', 'Valerie June', 'The Order of Time', 'n/a', '80', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', '90', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a']


### Aggregating Search Results
We can take the scraped table and store it in a Pandas DataFrame.

In [25]:
from pandas import DataFrame

## Define colnames (with a bit of hacking).
colnames = ['Rank'] + [col for sublist in rows[0] for col in sublist.replace(' ','').split('/')] 

## Store contents as DataFrame, skipping over rows
## identical to the first row.
df = DataFrame([row for row in rows if not row == rows[0]], columns=colnames)

df.head(5)

Unnamed: 0,Rank,AoTY,Artist,Album,A.V.,AMG,CoS,DIY,DiS,OMH,...,P4K,PM,PMA,RS,Spin,405,TLOBF,Skinny,TMT,Radar
0,1,92,Kendrick Lamar,DAMN.,100.0,90,91.0,100.0,90.0,,...,92.0,90,100.0,90.0,,95.0,,,100.0,80.0
1,2,90,Mount Eerie,A Crow Looked at Me,91.0,80,91.0,,,,...,90.0,100,83.0,,,90.0,,,100.0,80.0
2,3,86,Arca,Arca,91.0,80,83.0,,,,...,85.0,90,83.0,,,75.0,85.0,,80.0,
3,4,86,Valerie June,The Order of Time,,80,,,,,...,,90,,,,,,,,
4,5,86,Jlin,Black Origami,,90,91.0,,,,...,88.0,80,,,,80.0,,,80.0,


### Putting it all together
All of this code can be assembled into one script that will act as a simple webcrawler and webscraper for the past and present AOTY tables.

In [26]:
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame, concat

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Define parameters.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Define URL base (notice the string substituion).
url = 'http://www.albumoftheyear.org/ratings/overall/%s/'

## Define years.
years = range(2013,2017)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

tables = []
for year in years:
    
    ## Make request.
    request = requests.get(url %year)

    ## Define a new BeautifulSoup object.
    soup = BeautifulSoup(request.text, 'lxml')

    ## Extract all rows from the table.
    rows = []
    for link in soup.find_all('tr'):
        rows.append( [descendant for descendant in link.descendants if isinstance(descendant,str)] )
        
    ## Define colnames (with a bit of hacking).
    colnames = ['Rank'] + [col for sublist in rows[0] for col in sublist.replace(' ','').split('/')] 

    ## Store contents as DataFrame, skipping over rows
    ## identical to the first row.
    df = DataFrame([row for row in rows if not row == rows[0]], columns=colnames)
    
    ## Define year variable.
    df.insert(0, 'Year', year)
    
    ## Store.
    tables.append(df)
    
## Concatenate and sort by AOTY score.
tables = concat(tables)
tables = tables.sort_values(['AoTY'], ascending=False)
tables.head(5)

Unnamed: 0,Year,Rank,AoTY,Artist,Album,A.V.,AMG,CoS,DIY,DiS,...,P4K,PM,PMA,RS,Spin,405,TLOBF,Skinny,TMT,Radar
0,2015,1,94,Kendrick Lamar,To Pimp a Butterfly,91.0,100.0,100.0,80.0,100.0,...,93.0,90.0,100.0,90.0,100,100.0,90.0,,100.0,90.0
0,2014,1,93,D'Angelo,Black Messiah,100.0,90.0,91.0,,90.0,...,94.0,90.0,100.0,90.0,90,90.0,90.0,,100.0,
0,2013,1,91,Deafheaven,Sunbather,100.0,90.0,90.0,,,...,89.0,90.0,,60.0,80,90.0,,100.0,,
1,2015,2,90,Sufjan Stevens,Carrie & Lowell,100.0,80.0,100.0,80.0,90.0,...,93.0,60.0,100.0,80.0,80,95.0,85.0,80.0,100.0,85.0
2,2015,3,90,Mbongwana Star,From Kinshasa,,,,,100.0,...,,,,,90,,85.0,,,
