# Web Scraping Lecture Part 2 

## Content
5. Interacting with JSON APIs
6. Interacting with other type of APIs
7. Preparing for Project : Job Data Scraping

# 5. Interacting with JSON API


## What is API?

API stands for “Application Programming Interface.” It is an interface or communication protocol between a client and a server intended to simplify the building of client-side software. It has been described as a “contract” between the client and the server, such that if the client makes a request in a specific format, it will always get a response in a specific format or initiate a defined action. An API may be for a web-based system, operating system, database system, computer hardware, or software library.(Wiki)

Many websites have public APIs providing data feeds via JSON or some other formats. 

## JSON Data Introduction

JSON short for JavaScript Object Notation, has become one of the standard formats for sending data by HTTP request between web browers and other applications. It is a much more free-form data format than a tabular text form like CSV. 

For exmaple, this website shows the last 30 GitHub issues for pandas, https://api.github.com/repos/pandas-dev/pandas/issues. We call it is an API, and the data format is in JSON. Most web APIs return to JSON or xml format. 

Another example of JSON:

    {"ticker":"AAPL:US","return_code":0,"ttl":300,"disp_name":"Apple Inc","last_price":222.11,"price_precision":3.0,"time_of_last_updt":"2018-10-12","pct_chge_1D":3.57192993}

It is a much more free-form data format than a tabular text form like CSV. 

Sometimes people see json as nested dictionary in python. 

**json library**

There are several python libraries for reading and writing JSON data. *json* is one of those. To convert a JSON string to Python form, we use *json.loads*.



In [1]:
#Define a json string. 
obj="""
{"name":"Wes",
 "places_lived":["United States","Spain","China"],
 "siblings":[{"name":"Scott","age":30,"pets":["Zeus","Zuko"]},
             {"name":"Katie","age":38,"pets":["Sixes","Cisco"]}]
}
"""
obj

'\n{"name":"Wes",\n "places_lived":["United States","Spain","China"],\n "siblings":[{"name":"Scott","age":30,"pets":["Zeus","Zuko"]},\n             {"name":"Katie","age":38,"pets":["Sixes","Cisco"]}]\n}\n'

In [2]:
import json
result=json.loads(obj) # json.loads() convert a json string to python dictionary.
result
#result can be treated as a dictionary in python. 

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'China'],
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Cisco']}]}

<font color='red'>Exercise:</font>
Can we convert result (as it is already a dictionary) into dataframe?

In [3]:
# The answer is no because the dictionary needs to have same length in the value of each elements

In [4]:
result['siblings']

[{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
 {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Cisco']}]

In [5]:
result['siblings'][0]

{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']}

In [6]:
result['siblings'][0]['pets']

['Zeus', 'Zuko']

In [7]:
result['siblings'][0]['pets'][0]

'Zeus'

In [8]:
#converts a python object back to JSON string. 
asjon=json.dumps(result)
asjon

'{"name": "Wes", "places_lived": ["United States", "Spain", "China"], "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Cisco"]}]}'

In [9]:
import pandas as pd
siblings=pd.DataFrame(result['places_lived'])
siblings

Unnamed: 0,0
0,United States
1,Spain
2,China


In [10]:
import pandas as pd
siblings=pd.DataFrame(result["siblings"])
siblings

Unnamed: 0,name,age,pets
0,Scott,30,"[Zeus, Zuko]"
1,Katie,38,"[Sixes, Cisco]"


In [11]:
import pandas as pd
siblings=pd.DataFrame(result["siblings"][0])
siblings

Unnamed: 0,name,age,pets
0,Scott,30,Zeus
1,Scott,30,Zuko


In [12]:
import pandas as pd
siblings=pd.DataFrame(result["siblings"][0]['pets'])
siblings

Unnamed: 0,0
0,Zeus
1,Zuko


## <font color='red'>**Exercise: Access the nested key ‘salary’ from the following JSON**</font>

write code to print the value of salary

expected output: 

7000

In [13]:
import json

sampleJson = """{ 
   "company":{ 
      "employee":{ 
         "name":"emma",
         "payble":{ 
            "salary":7000,
            "bonus":800
         }
      }
   }
}"""

data = json.loads(sampleJson)
print(data['company']['employee']['payble']['salary'])

7000


## <font color='red'>**Exercise:**</font>
Parse the following JSON to get all the values of a key ‘name’ within an array

Expected output:

["name1", "name2"]

In [14]:
import json

sampleJson = """[ 
   { 
      "id":1,
      "name":"name1",
      "color":[ 
         "red",
         "green"
      ]
   },
   { 
      "id":2,
      "name":"name2",
      "color":[ 
         "pink",
         "yellow"
      ]
   }
]"""

data = []
data = json.loads(sampleJson)
dataList = [item.get('name') for item in data]
print(dataList)

['name1', 'name2']


## Scraping JSON using requests lib
One easy-to-use method to access APIs from python is the **requests** package. The requests library will make a **GET** request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

To find the last 30 GitHub issues for pandas on GitHub, we make a **GET HTTP** request. 

In [15]:
import requests
url='https://api.github.com/repos/pandas-dev/pandas/issues'
resp=requests.get(url)#When you ping a website or portal for information this is called making a request. 
                       #That is exactly what the Requests library has been designed to do.
resp # resp is a Response object. 

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. 

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

In [16]:
data=resp.json()# Response object's json method will return to a dictionary 
                # containing JSON parsed into native python objects. 
len(data)
# each element in data is a dictionary containing all the data found on a GitHub issue page. 

30

In [17]:
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/events',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/52430',
  'id': 1654728051,
  'node_id': 'PR_kwDOAA0YD85NoUjI',
  'number': 52430,
  'title': 'PERF: Series.to_numpy with float dtype and na_value=np.nan',
  'user': {'login': 'lukemanley',
   'id': 8519523,
   'node_id': 'MDQ6VXNlcjg1MTk1MjM=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/8519523?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/lukemanley',
   'html_url': 'https://github.com/lukemanley',
   'followers_url': 'https://api.github.com/users/lukemanley/followers',
   'following_url'

## Scraping JSON using urllib lib

Please find the last 30 GitHub issues for pandas on GitHub. 

In [18]:
from urllib.request import urlopen
import json
url='https://api.github.com/repos/pandas-dev/pandas/issues'
htmlfile = urlopen(url)
htmltext=htmlfile.read() #htmltext is a json string object
data = json.loads(htmltext) #json.loads() method convert a json string to a list of Python dictionaries. 
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/52430/events',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/52430',
  'id': 1654728051,
  'node_id': 'PR_kwDOAA0YD85NoUjI',
  'number': 52430,
  'title': 'PERF: Series.to_numpy with float dtype and na_value=np.nan',
  'user': {'login': 'lukemanley',
   'id': 8519523,
   'node_id': 'MDQ6VXNlcjg1MTk1MjM=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/8519523?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/lukemanley',
   'html_url': 'https://github.com/lukemanley',
   'followers_url': 'https://api.github.com/users/lukemanley/followers',
   'following_url'

**<font color='red'>Exercise</font>**

Can you name a few common keywords and convert data to a DataFrame that contains the value of those keywords? Howe many rows does it have?

**<font color='red'>Answer of the Exercise</font>**

In [19]:
issues=pd.DataFrame(data,columns=['number','title','labels','state', 'created_at'])
issues

Unnamed: 0,number,title,labels,state,created_at
0,52430,PERF: Series.to_numpy with float dtype and na_...,"[{'id': 8935311, 'node_id': 'MDU6TGFiZWw4OTM1M...",open,2023-04-05T00:03:05Z
1,52429,API/DEPR: dtype=(str|bytes) interpret as pyarrow,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2023-04-04T23:08:12Z
2,52428,REF: define reductions non-dynamically,[],open,2023-04-04T23:02:26Z
3,52427,BUG: describe with double[pyarrow] does not re...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2023-04-04T22:27:36Z
4,52426,API: Series/DataFrame from empty dict should h...,[],open,2023-04-04T22:18:16Z
5,52425,BUG: to_datetime crashes with float32[pyarrow]...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2023-04-04T21:06:26Z
6,52424,PERF: avoid is_dtype_equal,[],open,2023-04-04T20:57:36Z
7,52423,TYP/TST: tests/mypy complaints failing locally,[],open,2023-04-04T20:56:00Z
8,52422,BUG: merge with arrow and numpy dtypes raises,"[{'id': 13098779, 'node_id': 'MDU6TGFiZWwxMzA5...",open,2023-04-04T20:26:42Z
9,52421,PERF: lazify IO imports,[],open,2023-04-04T20:23:09Z


# 6. Interacting with other type of APIs

### First: failed example of Stock price data scraping 

In http://www.bloomberg.com/quote/SPX:IND, locate the html tags that contains the S&P index price. 

The following code could have worked couple of years ago. However, it is no longer working. Why?

In [20]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS

url=urlopen('http://www.bloomberg.com/quote/SPX:IND')
soup1=BS(url,'html.parser')
#print(soup1)
#This object soup contains the HTML of the page. Then we start coding the part that extracts the data. 
#Check the source code of the data that we need. 
#HTML class name is unique on each page. We can simply query <div class='name'>.
#BeautifulSoup can help us get into these layers and extract the content with find(). 
#Get the index price.Using find() function, under <div>, attribute is the 'class='name''. 
price_box=soup1.findAll('span',attrs={'class':"priceText__1853e8a5"})
print (price_box)
print(soup1.findAll('p'))

[]
[<p class="continue">To continue, please click the box below to let us know you're not a robot.</p>, <p class="info__text">Please make sure your browser supports JavaScript and cookies and that you are not
            blocking them from loading.
            For more information you can review our <a class="info__link" href="/notices/tos">Terms of
                Service</a> and <a class="info__link" href="/notices/tos">Cookie Policy</a>.</p>, <p class="info__text">For inquiries related to this message please <a class="info__link" href="/feedback">contact
            our support team</a> and provide the reference ID below.</p>]


Some big websites provide free online source for marketing data, finance data or other business data. For example, Google Finance, Yahoo Finance, indeed, twitter, LinkedIn and so on. 

Python has simple remote data access for the API data provided by those websites. Or we can install the packages which were designed by those websites, in order to call APIs of those websites. 