# <center>Web Scraping by API </center>

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

Packages used in this notebook:

- snscrape: for scrape tweets
- tika: parse PDF files

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [None]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    result = r.json()
    print (json.dumps(result, indent=4))



In [None]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [None]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    #print(result)
    df = pd.DataFrame(result["items"])
    df.head()
    

In [None]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 3 items to string
#result["items"][0:2]

s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

## 4. Parse PDF Files
- Many python packages are available to parse pdf files
  - PDFMiner: A tool for extracting information from PDF documents. It can show exact location of text in a page, as well as other information such as fonts or lines. 
  - PyPDF2: A pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. 
  - Tabula-py:  It can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. 
  - Tika: A Python port of the Apache Tika library (https://github.com/chrismattmann/tika-python)
- For detailed analysis, see https://towardsdatascience.com/python-for-pdf-ef0fac2808b0

In [None]:
! pip install tika

In [None]:
# 4.1. Parse PDF file using Tika

from tika import parser

In [None]:
# Parse a local pdf file

# replace 'Assignment_Python.pdf' by any pdf file you can find

parsed = parser.from_file('Assignment_Python.pdf')

# Print meta data of the pdf file
print(parsed["metadata"])

In [None]:
#Print the text of the pdf file
print(parsed["content"])

In [None]:
# Parse the file to XHTML
# Sometimes, it's better to see the document structure through XHTML

parsed = parser.from_file('Assignment_Python.pdf', xmlContent=True)
print(parsed["content"])

## 5. Get Tweets

Reference: 
- https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: 

- User object is not exposed by TwitterSearchScraper any more
- snscrape does not work with Python 3.9. See https://github.com/JustAnotherArchivist/snscrape/issues/111 to fix the bug. If your Python is earlier than 3.9, it should be OK



In [2]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import itertools


In [5]:
#  search by keywords + time

    
df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2020-10-31 until:2020-11-03"').get_items(), 100))

print(len(df))
df.head()

100


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
0,https://twitter.com/zawphyowai199/status/13234...,2020-11-02 23:59:55+00:00,Synic Token Airdrop is now Live🚀💰🏆\n\nClick on...,1323414568236806146,zawphyowai199,[https://t.me/synictoken_officialAirdropbot],https://t.me/synictoken_officialAirdropbot,[https://t.co/O83hXYOniH],https://t.co/O83hXYOniH
1,https://twitter.com/CryptoWatchBot/status/1323...,2020-11-02 23:59:53+00:00,"@NEO_Blockchain, #NEO is the coin with the bes...",1323414560204824576,CryptoWatchBot,[],,[],
2,https://twitter.com/Rayinhosen/status/13234145...,2020-11-02 23:59:46+00:00,"📌 CPX Airdrop is Live, 🎁 Join to get Free 7 CP...",1323414530769051648,Rayinhosen,[https://t.me/CrypxieAirdrop_bot?start=r017400...,https://t.me/CrypxieAirdrop_bot?start=r0174001...,[https://t.co/dcfBIyNYi4],https://t.co/dcfBIyNYi4
3,https://twitter.com/coinmarketnet/status/13234...,2020-11-02 23:59:43+00:00,"📌 CPX Airdrop is Live, 🎁 Join to get Free 7 CP...",1323414518651781120,coinmarketnet,[https://t.me/CrypxieAirdrop_bot?start=r076622...,https://t.me/CrypxieAirdrop_bot?start=r0766228434,[https://t.co/Ff0IdY5i1G],https://t.co/Ff0IdY5i1G
4,https://twitter.com/Link_Errors/status/1323414...,2020-11-02 23:59:42+00:00,Yearnify Finance Airdrop is now Live🚀💰🏆\n\nCli...,1323414512675016706,Link_Errors,[https://t.me/YearnifyAirdropBot],https://t.me/YearnifyAirdropBot,[https://t.co/6HpZOBOZzI],https://t.co/6HpZOBOZzI


In [6]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '"zawphyowai199"').get_items(), 100))

print(len(df))
df.head()

100


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
0,https://twitter.com/zawphyowai199/status/14384...,2021-09-16 09:13:26+00:00,@MrDogillionaire @bubbatoken 0x84c3b2941E49E08...,1438430818938998784,zawphyowai199,[],,[],
1,https://twitter.com/zawphyowai199/status/14384...,2021-09-16 09:12:46+00:00,@AlienGameBsc Thanks for the opportunity.\n\n0...,1438430652198653958,zawphyowai199,[],,[],
2,https://twitter.com/zawphyowai199/status/14377...,2021-09-14 13:31:39+00:00,@CryptoNetwork22 @airdropinspect @AirdropDet @...,1437771022997020678,zawphyowai199,[],,[],
3,https://twitter.com/zawphyowai199/status/14374...,2021-09-13 15:40:45+00:00,@solanaECT @mma728122 \n@zpaung26 \n@mst5792 \...,1437441125229293570,zawphyowai199,[],,[],
4,https://twitter.com/zawphyowai199/status/14366...,2021-09-11 10:07:48+00:00,@NFTLegendsBSC Bsc wallet :\n0xf84f43eA76BE8d6...,1436632560377819142,zawphyowai199,[],,[],
