# Web scraping: headers, the networks tab and parsing an API URL
## Helpful links and resources
- [urllib](https://docs.python.org/3/library/urllib.parse.html#) is a Python library that will pick apart URLs
- [Sessions object - request library](https://docs.python-requests.org/en/master/user/advanced/#session-objects)

In [1]:
#import libraries
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests

## The networks tab
### Static data files
[Covid cases in the US - New York Times](https://www.nytimes.com/interactive/2021/us/covid-cases.html)

In [2]:
# get static data file
#inspect>network>XHR>data.json>header
nyt_url = "https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/data/pages/usa/data.json"
r = requests.get(nyt_url)

In [3]:
nyt_covid = r.json()

In [4]:
# nyt_covid   --- comment out after checking the data

### "Secret" APIs
Shopping websites are good candidates for secret APIs, such as [Target](www.target.com)

Goal: identify prices & ratings of the first 24 results that appear when searching for paper cups. 

#### Target's Search API

In [5]:
# search for an item with the networks tab open to ID which APIs you can use
url = "https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=ff457966e64d5e877fdbad070f276d18ecec4a01&channel=WEB&count=24&default_purchasability_filter=true&include_sponsored=true&keyword=bbq+grill&offset=0&page=%2Fs%2Fbbq+grill&platform=desktop&pricing_store_id=1122&scheduled_delivery_store_id=1122&store_ids=1122%2C321%2C1054%2C3265%2C2185&useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36&visitor_id=017A775ABC630201BA0F30D2CCBC25CA"

In [6]:
# parse the URL so it's easier to read
target_url = urlparse(url)

In [7]:
# check the parsed URL
target_url

ParseResult(scheme='https', netloc='redsky.target.com', path='/redsky_aggregations/v1/web/plp_search_v1', params='', query='key=ff457966e64d5e877fdbad070f276d18ecec4a01&channel=WEB&count=24&default_purchasability_filter=true&include_sponsored=true&keyword=bbq+grill&offset=0&page=%2Fs%2Fbbq+grill&platform=desktop&pricing_store_id=1122&scheduled_delivery_store_id=1122&store_ids=1122%2C321%2C1054%2C3265%2C2185&useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36&visitor_id=017A775ABC630201BA0F30D2CCBC25CA', fragment='')

In [8]:
# format the endpoint and parameters
endpoint = target_url[0] + '://' + target_url[1] + target_url[2]
params = {}
for parameter in target_url[4].split('&'):
    key_value = parameter.split('=')
    params[key_value[0]] = key_value[1]
print(endpoint), print(params)

https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1
{'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01', 'channel': 'WEB', 'count': '24', 'default_purchasability_filter': 'true', 'include_sponsored': 'true', 'keyword': 'bbq+grill', 'offset': '0', 'page': '%2Fs%2Fbbq+grill', 'platform': 'desktop', 'pricing_store_id': '1122', 'scheduled_delivery_store_id': '1122', 'store_ids': '1122%2C321%2C1054%2C3265%2C2185', 'useragent': 'Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36', 'visitor_id': '017A775ABC630201BA0F30D2CCBC25CA'}


(None, None)

In [9]:
params.keys()

dict_keys(['key', 'channel', 'count', 'default_purchasability_filter', 'include_sponsored', 'keyword', 'offset', 'page', 'platform', 'pricing_store_id', 'scheduled_delivery_store_id', 'store_ids', 'useragent', 'visitor_id'])

In [10]:
params['keyword']

'bbq+grill'

In [11]:
# change something in the parameters (like keyword)
params['keyword'] = 'camp+tent'

In [12]:
params

{'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01',
 'channel': 'WEB',
 'count': '24',
 'default_purchasability_filter': 'true',
 'include_sponsored': 'true',
 'keyword': 'camp+tent',
 'offset': '0',
 'page': '%2Fs%2Fbbq+grill',
 'platform': 'desktop',
 'pricing_store_id': '1122',
 'scheduled_delivery_store_id': '1122',
 'store_ids': '1122%2C321%2C1054%2C3265%2C2185',
 'useragent': 'Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36',
 'visitor_id': '017A775ABC630201BA0F30D2CCBC25CA'}

In [13]:
# get request with endpoint and params
r = requests.get(endpoint, params=params)

In [14]:
# drill down the json file
target_json=r.json()
# target_json  -- comment out afterr checking the data

In [15]:
# drill down some more
# target_json.keys()
# target_json['data'].keys()
# target_json['data']['search'].keys()
len(target_json['data']['search']['products'])

24

In [40]:
# target_json['data']['search']['products'][0]

In [16]:
# target_json['data']['search']['products'][0].keys()
target_json['data']['search']['products'][0]['price']

{'current_retail': 79.99,
 'formatted_current_price': '$79.99',
 'formatted_current_price_type': 'reg'}

In [17]:
camptents = target_json['data']['search']['products']
for item in camptents:
    print(item['price']['current_retail'])

79.99
64.99
79.99
289.99
159.99
21.99
24.49
39.99
110.99
129.99
949.99
269.99
157.99
155.99
99.99
189.99
59.99
199.99
324.99
259.9
199.99
155.99
119.99
324.99


#### Target's aggregation API

In [18]:
# parse the URL so it's easier to read
url2 = "https://redsky.target.com/redsky_aggregations/v1/web/plp_fulfillment_v1?key=ff457966e64d5e877fdbad070f276d18ecec4a01&tcins=49143327%2C15324510%2C78260419%2C76694785%2C78260420%2C54521249%2C54520776%2C80558045%2C78260418%2C80189790%2C81315774%2C82297021%2C76077445%2C76175715%2C80189779%2C54588573%2C78260422%2C82238797%2C79715181%2C76136537%2C82297024%2C76144854%2C54406954%2C76147591&store_id=1122&zip=94404&state=CA&latitude=37.560&longitude=-122.280&scheduled_delivery_store_id=1122"
aggregation_url = urlparse(url2)

In [19]:
# check the parsed URL
aggregation_url

ParseResult(scheme='https', netloc='redsky.target.com', path='/redsky_aggregations/v1/web/plp_fulfillment_v1', params='', query='key=ff457966e64d5e877fdbad070f276d18ecec4a01&tcins=49143327%2C15324510%2C78260419%2C76694785%2C78260420%2C54521249%2C54520776%2C80558045%2C78260418%2C80189790%2C81315774%2C82297021%2C76077445%2C76175715%2C80189779%2C54588573%2C78260422%2C82238797%2C79715181%2C76136537%2C82297024%2C76144854%2C54406954%2C76147591&store_id=1122&zip=94404&state=CA&latitude=37.560&longitude=-122.280&scheduled_delivery_store_id=1122', fragment='')

In [20]:
# format the endpoint and parameters
a_endpoint = aggregation_url[0] + '://' + aggregation_url[1] + aggregation_url[2]
a_params = {}
for parameter in aggregation_url[4].split('&'):
    key_value = parameter.split('=')
    a_params[key_value[0]] = key_value[1]
print(a_endpoint), print(a_params)

https://redsky.target.com/redsky_aggregations/v1/web/plp_fulfillment_v1
{'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01', 'tcins': '49143327%2C15324510%2C78260419%2C76694785%2C78260420%2C54521249%2C54520776%2C80558045%2C78260418%2C80189790%2C81315774%2C82297021%2C76077445%2C76175715%2C80189779%2C54588573%2C78260422%2C82238797%2C79715181%2C76136537%2C82297024%2C76144854%2C54406954%2C76147591', 'store_id': '1122', 'zip': '94404', 'state': 'CA', 'latitude': '37.560', 'longitude': '-122.280', 'scheduled_delivery_store_id': '1122'}


(None, None)

In [21]:
# change something in the parameters (like tcins)
a_params.keys()
a_params['tcins'] = '49143327'

In [22]:
# get request with endpoint and params
response = requests.get(a_endpoint, a_params)
a_taget_json = response.json()
a_taget_json

{'data': {'product_summaries': [{'__typename': 'ProductSummary',
    'tcin': '49143327',
    'fulfillment': {'product_id': '49143327',
     'is_out_of_stock_in_all_store_locations': False,
     'shipping_options': {'availability_status': 'IN_STOCK',
      'loyalty_availability_status': 'IN_STOCK',
      'available_to_promise_quantity': 2459.0,
      'minimum_order_quantity': 1.0,
      'services': [{'shipping_method_id': 'STANDARD',
        'min_delivery_date': '2021-07-10',
        'max_delivery_date': '2021-07-10',
        'is_two_day_shipping': False,
        'is_base_shipping_method': True,
        'service_level_description': 'Standard Shipping',
        'shipping_method_short_description': 'Standard',
        'cutoff': '2021-07-08T19:00:00Z'}]},
     'store_options': [{'location_name': 'San Mateo Fashion Island',
       'location_address': '2220 Bridgepointe Pkwy,San Mateo,CA,94404-1569',
       'location_id': '1122',
       'search_response_store_type': 'PRIMARY',
       'order_

In [23]:
# drill down the json file
# a_taget_json.keys()
# a_taget_json['data'].keys()
len(a_taget_json['data']['product_summaries'])

1

In [24]:
# a_taget_json['data']['product_summaries'][0].keys()
a_taget_json['data']['product_summaries'][0]['fulfillment'].keys()
a_taget_json['data']['product_summaries'][0]['fulfillment']['scheduled_delivery']

{'availability_status': 'IN_STOCK'}

## Using sessions to login
### Accessing password-protected pages
[Sessions object - request library](https://docs.python-requests.org/en/master/user/advanced/#session-objects)

### I tried some webpages behind the paywall but I couldn't find any good url with "config/config.json". 
Instead of  this assignment, I tried to get a data behind the data story on WSJ via inspect>network tab (like Wapo's postal servise example in the previous class.) 
Article URL: https://www.wsj.com/graphics/can-you-guess-how-many-hotel-chains-equal-the-value-of-airbnb/
It was a good practice for me. However, if you know any major web services fit to try this HW, please let me know.

In [25]:
wsj = "https://wsjnewsgraphics.s3.amazonaws.com/projects/archibald/1RECjZL0QOp8OSrvsRxhWCr0M230lH_MKwALylJ1Issk-dev.json"

In [26]:
wsj_j = requests.get(wsj)
wsj_json = wsj_j .json()
# wsj_json --- comment out agter checking the data

In [27]:
wsj_json.keys()
wsj_json['ipo'][0]
# len(wsj_json['ipo'])

{'date': '9-Dec-20',
 'dealValue': 3830,
 'MarketValues_PostDeal': 40966,
 'marketValue': 117664.17,
 'company': 'Airbnb',
 'nationality': 'United States',
 'Offer Price': 68}

In [28]:
ipos = wsj_json['ipo']
result = []
for ipo in ipos:
    result.append({ipo['company']: ipo['marketValue']})

In [29]:
result

[{'Airbnb': 117664.17},
 {'Snowflake': 86642.75},
 {'Lufax': 42757.27},
 {'DoorDash': 57568.89},
 {'Qualtrics': 14541.7},
 {'Wish': 16787.69},
 {'Unity': 35187.8},
 {'GoodRx': 22101.67},
 {'Affirm': 25112.15},
 {'Playtika': 12697.16},
 {'Dun & Bradstreet': 10927.1},
 {'McAfee': 8729.7},
 {'ZoomInfo': 22611.18},
 {'Root': 5459.88},
 {'GoHealth': 4618.57},
 {'Bentley Systems': 12677.35},
 {'Ozon': 11601.51},
 {'Chindata': 7151.2},
 {'Datto': 4061.01},
 {'Rackspace': 4517.88}]

In [30]:
# open up a session so that your login credentials are saved

In [31]:
# load in config file with passwords

In [32]:
# check the website for the login parameters

In [33]:
# post the payload to the site to login with the correct log in endpoint

In [34]:
# check credentials to see if successful

In [35]:
# look at an example page to get you started with a query

In [36]:
# create a new post object from the example

In [37]:
# post request for the data

In [38]:
# check to see what is returned