# Web Scraping Lecture Part 2 

## Content
5. Interacting with JSON APIs
6. Interacting with other type of APIs
7. Preparing for Project : Job Data Scraping

# 5. Interacting with JSON API


## What is API?

API stands for “Application Programming Interface.” It is an interface or communication protocol between a client and a server intended to simplify the building of client-side software. It has been described as a “contract” between the client and the server, such that if the client makes a request in a specific format, it will always get a response in a specific format or initiate a defined action. An API may be for a web-based system, operating system, database system, computer hardware, or software library.(Wiki)

Many websites have public APIs providing data feeds via JSON or some other formats. 

## JSON Data Introduction

JSON short for JavaScript Object Notation, has become one of the standard formats for sending data by HTTP request between web browers and other applications. It is a much more free-form data format than a tabular text form like CSV. 

For exmaple, this website shows the last 30 GitHub issues for pandas, https://api.github.com/repos/pandas-dev/pandas/issues. We call it is an API, and the data format is in JSON. Most web APIs return to JSON or xml format. 

Another example of JSON:

    {"ticker":"AAPL:US","return_code":0,"ttl":300,"disp_name":"Apple Inc","last_price":222.11,"price_precision":3.0,"time_of_last_updt":"2018-10-12","pct_chge_1D":3.57192993}

It is a much more free-form data format than a tabular text form like CSV. 

Sometimes people see json as nested dictionary in python. 

**json library**

There are several python libraries for reading and writing JSON data. *json* is one of those. To convert a JSON string to Python form, we use *json.loads*.



In [1]:
#Define a json string. 
obj="""
{"name":"Wes",
 "places_lived":["United States","Spain","China"],
 "siblings":[{"name":"Scott","age":30,"pets":["Zeus","Zuko"]},
             {"name":"Katie","age":38,"pets":["Sixes","Cisco"]}]
}
"""
obj

'\n{"name":"Wes",\n "places_lived":["United States","Spain","China"],\n "siblings":[{"name":"Scott","age":30,"pets":["Zeus","Zuko"]},\n             {"name":"Katie","age":38,"pets":["Sixes","Cisco"]}]\n}\n'

In [2]:
import json
result=json.loads(obj) # json.loads() convert a json string to python dictionary.
result
#result can be treated as a dictionary in python. 

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'China'],
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Cisco']}]}

In [3]:
result['siblings']

[{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
 {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Cisco']}]

In [4]:
result['siblings'][0]

{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']}

In [5]:
result['siblings'][0]['pets']

['Zeus', 'Zuko']

In [6]:
result['siblings'][0]['pets'][0]

'Zeus'

In [7]:
#converts a python object back to JSON string. 
asjon=json.dumps(result)
asjon

'{"name": "Wes", "places_lived": ["United States", "Spain", "China"], "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Cisco"]}]}'

In [8]:
import pandas as pd
siblings=pd.DataFrame(result['places_lived'])
siblings

Unnamed: 0,0
0,United States
1,Spain
2,China


In [9]:
siblings=pd.DataFrame(result["siblings"])
siblings

Unnamed: 0,name,age,pets
0,Scott,30,"[Zeus, Zuko]"
1,Katie,38,"[Sixes, Cisco]"


In [10]:
siblings=pd.DataFrame(result["siblings"][0])
siblings

Unnamed: 0,name,age,pets
0,Scott,30,Zeus
1,Scott,30,Zuko


In [11]:
siblings=pd.DataFrame(result["siblings"][0]['pets'])
siblings

Unnamed: 0,0
0,Zeus
1,Zuko


## <font color='red'>**Exercise: Access the nested key ‘salary’ from the following JSON**</font>

write code to print the value of salary

expected output: 

7000

In [12]:
import json

sampleJson = """{ 
   "company":{ 
      "employee":{ 
         "name":"emma",
         "payble":{ 
            "salary":7000,
            "bonus":800
         }
      }
   }
}"""



## <font color='red'>**Exercise:**</font>
Parse the following JSON to get all the values of a key ‘name’ within an array

Expected output:

["name1", "name2"]

In [13]:
[ 
   { 
      "id":1,
      "name":"name1",
      "color":[ 
         "red",
         "green"
      ]
   },
   { 
      "id":2,
      "name":"name2",
      "color":[ 
         "pink",
         "yellow"
      ]
   }
]

[{'id': 1, 'name': 'name1', 'color': ['red', 'green']},
 {'id': 2, 'name': 'name2', 'color': ['pink', 'yellow']}]

## Scraping JSON using requests lib
One easy-to-use method to access APIs from python is the **requests** package. The requests library will make a **GET** request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

To find the last 30 GitHub issues for pandas on GitHub, we make a **GET HTTP** request. 

In [14]:
import requests
url='https://api.github.com/repos/pandas-dev/pandas/issues'
resp=requests.get(url)#When you ping a website or portal for information this is called making a request. 
                       #That is exactly what the Requests library has been designed to do.
resp # resp is a Response object. 

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. 

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

In [15]:
data=resp.json()# Response object's json method will return to a dictionary 
                # containing JSON parsed into native python objects. 
len(data)
# each element in data is a dictionary containing all the data found on a GitHub issue page. 

30

In [16]:
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/events',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/57222',
  'id': 2116253943,
  'node_id': 'PR_kwDOAA0YD85l502b',
  'number': 57222,
  'user': {'login': 'Jorewin',
   'id': 56088851,
   'node_id': 'MDQ6VXNlcjU2MDg4ODUx',
   'avatar_url': 'https://avatars.githubusercontent.com/u/56088851?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/Jorewin',
   'html_url': 'https://github.com/Jorewin',
   'followers_url': 'https://api.github.com/users/Jorewin/followers',
   'following_url': 'https://api.github.com/users/Jorewin/following{/other_user}',
   'gists_url': 'h

## Scraping JSON using urllib lib

Please find the last 30 GitHub issues for pandas on GitHub. 

In [17]:
from urllib.request import urlopen
import json
url='https://api.github.com/repos/pandas-dev/pandas/issues'
htmlfile = urlopen(url)
htmltext=htmlfile.read() #htmltext is a json string object
data = json.loads(htmltext) #json.loads() method convert a json string to a list of Python dictionaries. 
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/57222/events',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/57222',
  'id': 2116253943,
  'node_id': 'PR_kwDOAA0YD85l502b',
  'number': 57222,
  'user': {'login': 'Jorewin',
   'id': 56088851,
   'node_id': 'MDQ6VXNlcjU2MDg4ODUx',
   'avatar_url': 'https://avatars.githubusercontent.com/u/56088851?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/Jorewin',
   'html_url': 'https://github.com/Jorewin',
   'followers_url': 'https://api.github.com/users/Jorewin/followers',
   'following_url': 'https://api.github.com/users/Jorewin/following{/other_user}',
   'gists_url': 'h

**<font color='red'>Exercise</font>**

Can you name a few common keywords and convert data to a DataFrame that contains the value of those keywords? Howe many rows does it have?

In [18]:

issues=pd.DataFrame(data,columns=['number','title','labels','state', 'created_at'])
issues



Unnamed: 0,number,title,labels,state,created_at
0,57222,ENH: Add all warnings check to the assert_prod...,[],open,2024-02-03T05:30:15Z
1,57221,TST: Reduce parameterization of test_str_find_e2e,"[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open,2024-02-03T05:09:15Z
2,57220,"Fix using ""python"" instead of ""sys.executable""",[],open,2024-02-03T03:25:52Z
3,57219,BUG: validate_docstring uses main python execu...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2024-02-03T03:25:24Z
4,57216,"ENH: In sort methods, accept sequence or dict ...","[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open,2024-02-02T21:37:00Z
5,57215,BUG: Error when trying to use `pd.date_range`,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2024-02-02T20:39:36Z
6,57214,ENH: pd.Series.str.isempty() to check for empt...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open,2024-02-02T20:03:53Z
7,57213,BUG: to_numeric loses precision when convertin...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2024-02-02T18:34:30Z
8,57212,BUG: ensure_string_array might modify read-onl...,[],open,2024-02-02T13:51:24Z
9,57209,BUG: qcut bins error,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open,2024-02-02T06:30:59Z


## Another Example

In [19]:
response = requests.get("https://jsonplaceholder.typicode.com/todos")
todos = json.loads(response.text)

In [20]:
todos[:10]

[{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False},
 {'userId': 1,
  'id': 2,
  'title': 'quis ut nam facilis et officia qui',
  'completed': False},
 {'userId': 1, 'id': 3, 'title': 'fugiat veniam minus', 'completed': False},
 {'userId': 1, 'id': 4, 'title': 'et porro tempora', 'completed': True},
 {'userId': 1,
  'id': 5,
  'title': 'laboriosam mollitia et enim quasi adipisci quia provident illum',
  'completed': False},
 {'userId': 1,
  'id': 6,
  'title': 'qui ullam ratione quibusdam voluptatem quia omnis',
  'completed': False},
 {'userId': 1,
  'id': 7,
  'title': 'illo expedita consequatur quia in',
  'completed': False},
 {'userId': 1,
  'id': 8,
  'title': 'quo adipisci enim quam ut ab',
  'completed': True},
 {'userId': 1,
  'id': 9,
  'title': 'molestiae perspiciatis ipsa',
  'completed': False},
 {'userId': 1,
  'id': 10,
  'title': 'illo est ratione doloremque quia maiores aut',
  'completed': True}]

In [21]:
# Map of userId to number of complete TODOs for that user
todos_by_user = {}

# Increment complete TODOs count for each user.
for todo in todos:
    if todo["completed"]:
        try:
            # Increment the existing user's count.
            todos_by_user[todo["userId"]] += 1
        except KeyError:
            # This user has not been seen. Set their count to 1.
            todos_by_user[todo["userId"]] = 1

# Create a sorted list of (userId, num_complete) pairs.
top_users = sorted(todos_by_user.items(), 
                   key=lambda x: x[1], reverse=True)

# Get the maximum number of complete TODOs.
max_complete = top_users[0][1]

# Create a list of all users who have completed
# the maximum number of TODOs.
users = []
for user, num_complete in top_users:
    if num_complete < max_complete:
        break
    users.append(str(user))

max_users = " and ".join(users)

In [22]:
s = "s" if len(users) > 1 else ""
print(f"user{s} {max_users} completed {max_complete} TODOs")

users 5 and 10 completed 12 TODOs
