# <center>Web Scraping by API </center>

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses HTTP requests to GET, PUT, POST and DELETE data
- Experiment:
    - Get an API key here: http://www.omdbapi.com/apikey.aspx and follow the instruction to activate the key
    - Use API, e.g. **http://www.omdbapi.com/<font color="blue"><b>?</b></font>t=Rogue+One<font color="blue"><b>&</b></font>plot=full<font color="blue"><b>&</b></font>r=json<font color="blue"><b>&</b></font>apikey={your api key}**, where
        - **t=Rogue+One**: specify the movie title 
        - **plot=full**: return full plot
        - **r=json**: result is in json format
        - **apikey**: use your api key 
    - Note the format of URL:
        - API endpoint: http://www.omdbapi.com/ 
        - parameters appear in the URL after the question mark (<font color="blue"><b>?</b></font>) after the endpoint
        - all parameters are concatenated by <font color="blue"><b>"&"</b></font>  
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [None]:
# Exercise 2.1. search movies by name

import requests
import json

title='Rogue+One'

# Search API: http://www.omdbapi.com/
# has four parameters: title, full plot, result format, and api_key
# For the get methods, parameters are attached to API URL after a "?"
# Parameters are separated by "&"

# to test, apply for an api key and use the key ere
url="http://www.omdbapi.com/?t="+title+\
    "&plot=full&r=json&apikey={your key here}"

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    result = r.json()
    print (json.dumps(result, indent=4))



In [None]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'t': 'Rogue+One', 
              'plot': 'full',
              'r': 'json',
              'apikey':{your key here}}

r=requests.get('http://www.omdbapi.com/', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [None]:
# Exercise 3.1 API returns a JSON object 

parameters = {'s': 'harry potter',# search by title
              'plot': 'short', 
              'page': 1,# get 1st page
              'r': 'json', # result format: json/xml
              'apikey':{your key here}}

r = requests.get('http://www.omdbapi.com/', params=parameters)
if r.status_code==200:
    result = r.json()
    print(result)
    
# you only retrieve the first 10 entries
# how to retrieve all results?

In [None]:
# Exercise 3.2. Parse JSON object (a dictionary)

# get a list of dictionaries
movies = result ["Search"]

# convert to string
s = json.dumps(movies, indent=4)
print(s)

# load from a string
movies1 = json.loads(s)
print(movies1)

# save to file
json.dump(movies, open("movies.json","w"))

# load from file
movies1 = json.load(open("movies.json","r"))
print(movies1)

## 4. Collect Tweets

- Real-time tweets: Tweepy package
  -  Install package
  -  Apply for authentication keys
- Hostorical tweets 
  - You can always search tweets at https://twitter.com/search and then scrape the results returned
  - Note that there is **no authentication needed**!
  - For reference, check github project, https://github.com/Jefferson-Henrique/GetOldTweets-pythonyou 
  - Motivated by this project, let's try the following code

### 4.1 Search historical tweets

In [None]:
# 4.1.1 Scrape past tweets using API 

import requests
from bs4 import BeautifulSoup
from datetime import datetime

# User agent must be defined in http request header
# a user agent is software that is acting on behalf of 
# a user. Usually it tells the browser used.
# some websites reject requests without a user agent
headers = { 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.'
                              '86 Safari/537.36'}

# specify parameters as a dictionary
payload={"f":"tweets",  # retrieve tweets
         "q":"blockchain since:2017-09-10 until:2017-09-12", # query string
         "max_position":''} # max_position of results (paging purpose)

# send a request with parameters and headers
r=requests.get("https://twitter.com/i/search/timeline",\
              params=payload, headers=headers)

# this is equivalent to type the following URL in 
# https://twitter.com/search?&q='blockchain since:2017-09-10 until:2017-09-12'
    
if r.status_code==200:
    result=r.json()
    print(result)
    
    # get html source code of tweets
    tweets_html = result['items_html']
    
    # Search returns tweets in the decending order of time
    # retrieve the position of the earliest tweets returned
    min_position = result['min_position']

In [None]:
# 4.1.2. define a function to parse tweets html 
# using BeautifulSoup

def getTweets(tweets_html):
    
    result=[]
    
    soup=BeautifulSoup(tweets_html, "html.parser")

    tweets=soup.select('div.js-stream-tweet')

    for t in tweets:
        username, text, timestamp, tweet_id = '','','',''
        select_user = t.select("span.username.u-dir b")
        if select_user!=[]:
            username=select_user[0].get_text()
    
        select_text = t.select("p.js-tweet-text")
        if select_text!=[]:
            text=select_text[0].get_text()
    
        time_select = t.select("small.time span.js-short-timestamp")
        if time_select!=[]:
            timestamp=int(time_select[0]["data-time"])
            timestamp=datetime.fromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S")
    
        tweet_id = t["data-tweet-id"]
    
        #print(username, text, timestamp, tweet_id, "\n")
        
        result.append({"user":username, "text":text, "date":timestamp})
        
    return result

In [None]:
# 4.1.3. Parse tweets using the function

tweets=getTweets(tweets_html)
print("total tweets:", len(tweets))
print(json.dumps(tweets, indent=4))

In [None]:
# 4.1.4. What if we want to return more?

# Note that the search returns tweets 
# in the decending order of time
# set the max_position to 
# the min_position of last search

payload={"f":"tweets",  # retrieve tweets
         "q":"blockchain since:2017-09-10 until:2017-09-12", # query string
         "max_position":min_position} 

# search again
r=requests.get("https://twitter.com/i/search/timeline",\
              params=payload, headers=headers)

if r.status_code==200:
    result=r.json()
    min_position = result['min_position']
    tweets_html = result['items_html']
    
    tweets=getTweets(tweets_html)
    print("total tweets:", len(tweets))
    print(json.dumps(tweets, indent=4))
    
# You can use a loop to keep sending requests
# until all tweets satisfying your criteria
# has been fetched.

In [None]:
# 4.1.5. Generate Wordcloud
    
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# concate all tweets into one 

text = " ".join([t["text"] for t in tweets])
#print(text)

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:

wordcloud = WordCloud(max_font_size=60).generate(text);
plt.figure();
plt.imshow(wordcloud, interpolation="bilinear");
plt.axis("off");
plt.show();


## 4. 2. Access tweet stream (e.g. real-time tweets) through tweepy package

- Please read
https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/
- You'll need to apply for authentication keys and tokens.