# Web Scraping 

We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. Web scraping is the practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. 

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API. Thus, we need to scrape the HTML website to fetch the information.

Non-standard python libraries needed in this tutorial include
* urllib
* beatifulsoup 
* requests

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# Task 0

Get public holiday information

In [2]:
html = urlopen("https://www.timeanddate.com/calendar/custom.html?year=2018&country=29&cols=3&hol=4194331&df=1")

In [3]:
html

<http.client.HTTPResponse at 0x10578abf0>

In [4]:
bs_text = html.read().decode('utf-8')
bs_text

'<!DOCTYPE html><!--\nscripts and programs that download content transparent to the user are not allowed without permission\n--><html lang=en><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><title>Year 2018 Calendar – Australia</title><meta name=description content="Australia 2018 – Customized Calendar with holidays. Yearly calendar showing months for the year 2018. Calendars – online and print friendly – for any year and month"><meta name=robots content="max-image-preview:large"><meta property="og:image" content="https://www.timeanddate.com/scripts/calendarog.php?image=sydney1&calendar=CALENDAR&year=2018&country=Australia&abstract=Holidays%20and%20Observances"><meta property="og:image:width" content=1366><meta property="og:image:height" content=738><meta property="og:type" content=website><style>\n@font-face{font-family:iconfont;src:url("/common/fonts/iconfont.woff2?v8") format("woff2"),url("/common/fonts/iconfont.woff?v8") format("woff"),url("/common/fonts/icon

Option 1:

In [5]:
soup = BeautifulSoup(bs_text)
soup.find(id='ch1')

<table class="ch1 cl1h cd2 ch" id="ch1"><thead><tr><th class="chh" colspan="1">Holidays and Observances:</th></tr></thead><tbody><tr><td><table><tbody><tr><td class="vt"><table class="cht lpad"><tbody><tr><td><span class="co1">1 Jan</span></td><td><a href="/holidays/australia/new-year-day" title="New Year's Day is the first day of the year in the Gregorian calendar used in Australia and many other countries.">New Year's Day</a></td></tr><tr><td><span class="co1">26 Jan</span></td><td><a href="/holidays/australia/australia-day" title="Australia Day is the Australian national day. It is celebrated on January 26 each year.">Australia Day</a></td></tr><tr><td><span class="co3">12 Feb</span></td><td><a href="/holidays/australia/hobart-regatta" title="The Royal Hobart Regatta is an annual public holiday in southern Tasmania, Australia, on the second Monday in February.">Royal Hobart Regatta (<span title="Tasmania: southern">Tasmania</span>)</a></td></tr><tr><td><span class="co3">16 Feb</span

In [6]:
xx = soup.find(id='ch1')
xx.find_all('tbody')[0].find_all('tbody')[0].find_all('tbody')[0].find_all('span')

[<span class="co1">1 Jan</span>,
 <span class="co1">26 Jan</span>,
 <span class="co3">12 Feb</span>,
 <span title="Tasmania: southern">Tasmania</span>,
 <span class="co3">16 Feb</span>,
 <span class="co3">17 Feb</span>,
 <span class="co3">5 Mar</span>,
 <span class="co3">12 Mar</span>,
 <span class="co3">12 Mar</span>,
 <span class="co3">12 Mar</span>,
 <span class="co3">12 Mar</span>,
 <span class="co3">30 Mar</span>,
 <span class="co1">30 Mar</span>,
 <span class="co3">31 Mar</span>,
 <span title="Australian Capital Territory">ACT</span>,
 <span title="New South Wales">NSW</span>,
 <span title="Northern Territory">NT</span>,
 <span title="Queensland">Qld</span>,
 <span title="South Australia">SA</span>,
 <span title="Victoria">Vic</span>,
 <span class="co3">1 Apr</span>,
 <span title="Australian Capital Territory">ACT</span>,
 <span title="New South Wales">NSW</span>,
 <span title="Queensland">Qld</span>,
 <span title="Victoria">Vic</span>,
 <span class="co1">2 Apr</span>,
 <span cla

Option 2:

In [7]:
pattern = '</thead><tbody>(.*)</tbody></table></td></tr>'
pattern_text = re.findall(pattern, bs_text)

In [8]:
pattern_text[0]

'<tr><td><table><tbody><tr><td class=vt><table class="cht lpad"><tbody><tr><td><span class="co1">1 Jan</span></td><td><a href="/holidays/australia/new-year-day" title="New Year&#39;s Day is the first day of the year in the Gregorian calendar used in Australia and many other countries.">New Year&#39;s Day</a></td></tr><tr><td><span class="co1">26 Jan</span></td><td><a href="/holidays/australia/australia-day" title="Australia Day is the Australian national day. It is celebrated on January 26 each year.">Australia Day</a></td></tr><tr><td><span class="co3">12 Feb</span></td><td><a href="/holidays/australia/hobart-regatta" title="The Royal Hobart Regatta is an annual public holiday in southern Tasmania, Australia, on the second Monday in February.">Royal Hobart Regatta (<span title="Tasmania: southern">Tasmania</span>)</a></td></tr><tr><td><span class="co3">16 Feb</span></td><td><a href="/holidays/australia/lunar-new-year" title="Lunar New Year marks the first day of the New Year in the Ch

In [9]:
bsobj = BeautifulSoup(pattern_text[0], "lxml")

In [10]:
bsobj

<html><body><tr><td><table><tbody><tr><td class="vt"><table class="cht lpad"><tbody><tr><td><span class="co1">1 Jan</span></td><td><a href="/holidays/australia/new-year-day" title="New Year's Day is the first day of the year in the Gregorian calendar used in Australia and many other countries.">New Year's Day</a></td></tr><tr><td><span class="co1">26 Jan</span></td><td><a href="/holidays/australia/australia-day" title="Australia Day is the Australian national day. It is celebrated on January 26 each year.">Australia Day</a></td></tr><tr><td><span class="co3">12 Feb</span></td><td><a href="/holidays/australia/hobart-regatta" title="The Royal Hobart Regatta is an annual public holiday in southern Tasmania, Australia, on the second Monday in February.">Royal Hobart Regatta (<span title="Tasmania: southern">Tasmania</span>)</a></td></tr><tr><td><span class="co3">16 Feb</span></td><td><a href="/holidays/australia/lunar-new-year" title="Lunar New Year marks the first day of the New Year in t

In [11]:
td1 = bsobj.find_all('tbody')[0].find_all('tr')[0].find_all('td')

In [12]:
td1[3].get_text()

'26 Jan'

In [13]:
len(td1)

104

In [14]:
holiday_list = []
year = 2018
for td in td1:
    result = td.find_all('table')
    if(len(result) != 0):
        trs = result[0].find_all('tbody')[0].find_all('tr')
        for tr in trs:
            td = tr.find_all('td')
            holiday = {}
            holiday['date'] = td[0].get_text()
            holiday['name'] = td[1].get_text()
            holiday['year'] = year
            holiday_list.append(holiday)

print(holiday_list)

[{'date': '1 Jan', 'name': "New Year's Day", 'year': 2018}, {'date': '26 Jan', 'name': 'Australia Day', 'year': 2018}, {'date': '12 Feb', 'name': 'Royal Hobart Regatta (Tasmania)', 'year': 2018}, {'date': '16 Feb', 'name': 'Lunar New Year (Christmas Island)', 'year': 2018}, {'date': '17 Feb', 'name': 'Lunar New Year Holiday (Day 2) (Christmas Island)', 'year': 2018}, {'date': '5 Mar', 'name': 'Labour Day (Western Australia)', 'year': 2018}, {'date': '12 Mar', 'name': 'Labour Day (Victoria)', 'year': 2018}, {'date': '12 Mar', 'name': 'Eight Hours Day (Tasmania)', 'year': 2018}, {'date': '12 Mar', 'name': 'Adelaide Cup (South Australia)', 'year': 2018}, {'date': '12 Mar', 'name': 'Canberra Day (Australian Capital Territory)', 'year': 2018}, {'date': '21 Mar', 'name': 'Harmony Day', 'year': 2018}, {'date': '30 Mar', 'name': 'Good Friday (Victoria)', 'year': 2018}, {'date': '30 Mar', 'name': 'Good Friday', 'year': 2018}, {'date': '31 Mar', 'name': 'Holy Saturday (ACT, Heard and McDonald Is

Let's wrap up everything

In [15]:
html = urlopen("https://www.timeanddate.com/calendar/custom.html?year=2019&country=29&cols=3&hol=4194331&df=1")
bs_text = html.read().decode('utf-8')
pattern2 = '</thead><tbody>(.*)</tbody></table></td></tr>'
str2 = re.findall(pattern2, bs_text)
bsobj = BeautifulSoup(str2[0], "lxml")
td1 = bsobj.find_all('tbody')[0].find_all('tr')[0].find_all('td')
holiday_list = []
year = 2019
for td in td1:
    result = td.find_all('table')
    if(len(result) != 0):
        tbo = result[0].find_all('tbody')[0].find_all('tr')
        for tr in tbo:
            td = tr.find_all('td')
            holiday = {}
            holiday['date'] = td[0].get_text()
            holiday['name'] = td[1].get_text()
            holiday['year'] = year
            holiday_list.append(holiday)
print(holiday_list)

[{'date': '1 Jan', 'name': "New Year's Day", 'year': 2019}, {'date': '26 Jan', 'name': 'Australia Day', 'year': 2019}, {'date': '28 Jan', 'name': 'Australia Day Observed (All except Christmas Island, Heard and McDonald Islands)', 'year': 2019}, {'date': '5 Feb', 'name': 'Lunar New Year (Christmas Island)', 'year': 2019}, {'date': '6 Feb', 'name': 'Lunar New Year Holiday (Day 2) (Christmas Island)', 'year': 2019}, {'date': '11 Feb', 'name': 'Royal Hobart Regatta (Tasmania)', 'year': 2019}, {'date': '4 Mar', 'name': 'Labour Day (Western Australia)', 'year': 2019}, {'date': '11 Mar', 'name': 'Labour Day (Victoria)', 'year': 2019}, {'date': '11 Mar', 'name': 'Eight Hours Day (Tasmania)', 'year': 2019}, {'date': '11 Mar', 'name': 'Adelaide Cup (South Australia)', 'year': 2019}, {'date': '11 Mar', 'name': 'Canberra Day (Australian Capital Territory)', 'year': 2019}, {'date': '18 Mar', 'name': 'Labour Day (Christmas Island)', 'year': 2019}, {'date': '21 Mar', 'name': 'Harmony Day', 'year': 20

Put it in a function:

In [16]:
def get_holiday_year(year):
    html = urlopen("https://www.timeanddate.com/calendar/custom.html?year="+str(year)+"&country=29&cols=3&hol=4194331&df=1")
    bs_text = html.read().decode('utf-8')
    pattern2 = '</thead><tbody>(.*)</tbody></table></td></tr>'
    str2 = re.findall(pattern2, bs_text)
    bsobj = BeautifulSoup(str2[0], "lxml")
    td1 = bsobj.find_all('tbody')[0].find_all('tr')[0].find_all('td')
    holiday_list = []
    for td in td1:
        result = td.find_all('table')
        if(len(result) != 0):
            tbo = result[0].find_all('tbody')[0].find_all('tr')
            for tr in tbo:
                td = tr.find_all('td')
                holiday = {}
                holiday['date'] = td[0].get_text()
                holiday['name'] = td[1].get_text()
                holiday['year'] = year
                holiday_list.append(holiday)
    return(holiday_list)

Get any years that you want:

In [17]:
holidays = []
for year in range(1990, 2021):
    print(get_holiday_year(year))
    holidays = holidays + get_holiday_year(year)
print(holidays)

[{'date': '1 Jan', 'name': "New Year's Day", 'year': 1990}, {'date': '29 Jan', 'name': 'Australia Day', 'year': 1990}, {'date': '12 Feb', 'name': 'Royal Hobart Regatta (Tasmania)', 'year': 1990}, {'date': '5 Mar', 'name': 'Labour Day (Western Australia)', 'year': 1990}, {'date': '5 Mar', 'name': 'Eight Hours Day (Tasmania)', 'year': 1990}, {'date': '12 Mar', 'name': 'Labour Day (Victoria)', 'year': 1990}, {'date': '19 Mar', 'name': 'Canberra Day (Australian Capital Territory)', 'year': 1990}, {'date': '6 Apr', 'name': 'Self Determination Day (Cocos and Keeling Islands)', 'year': 1990}, {'date': '13 Apr', 'name': 'Good Friday (Victoria)', 'year': 1990}, {'date': '13 Apr', 'name': 'Good Friday', 'year': 1990}, {'date': '14 Apr', 'name': 'Holy Saturday', 'year': 1990}, {'date': '15 Apr', 'name': 'Easter Sunday', 'year': 1990}, {'date': '16 Apr', 'name': 'Easter Monday', 'year': 1990}, {'date': '17 Apr', 'name': 'Easter Tuesday (Tasmania)', 'year': 1990}, {'date': '25 Apr', 'name': 'ANZAC 

In [18]:
import pandas as pd

In [19]:
df = pd.DataFrame(holidays)
df.to_csv("holidays.csv", index=False,encoding = 'utf-8')

## Task 1 Extract a list of links on a Wikipedia page.

Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If you look at the source code of the following page 
```
https://en.wikipedia.org/wiki/Kevin_Bacon
```
in your browser, you fill find that all these links have three things in common:
* They are in the *div* with id *set* to *bodyContent*
* The URLs do not contain semicolons
* The URLs begin with */wiki/*

We can use these rules to construct our search through the HTML page. 

Firstly, use the urlopen() function to open the wikipedia page for "Kevin Bacon",

In [20]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")

Then, find and print all the links. In order to finish this task, you need to
* find the *div* whose *id = "bodyContent"*
* find all the link tags, whose href starts with "/wiki/" and does not ends with ":". For example
```html
 see <a href="/wiki/Kevin_Bacon_(disambiguation)" class="mw-disambig" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>
```

Hint: regular expression is needed.

In [21]:
bsobj = BeautifulSoup(html, "lxml")
for link in bsobj.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/National_Lampoon%27s_Animal_House
/wiki/Footloose_(1984_film)
/wiki/Diner_(1982_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/Sleepers_(film)
/wiki/Wild_Things_(film)
/wiki/The_Woodsman_(2004_film)
/wiki/Flatliners
/wiki/Crazy,_Stupid,_Love
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wi

## Task 2 Perform a random walk through a given webpate.
Assume that we will find a random object in Wikipedia that is linked to "Kevin Bacon" with, so-called "Six Degrees of Wikipedia". In other words, the task is to find two subjects linked by a chain containing no more than six subjects (including the two original subjects).

In [23]:
import datetime
import random

#random.seed(datetime.datetime.now())
random.seed(1)
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")

The details of the random walk along the links are 
* Randomly choosing a link from the list of retrieved links
* Printing the article represented by the link
* Retrieving a list of links 
* repeat the above step until the number of retrieved articles reaches 5.

In [24]:
count = 0
while len(links) > 0 and count < 5:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)
    count = count + 1

/wiki/Guiding_Light
/wiki/Rosemary_(radio_series)
/wiki/Dash_(detergent)
/wiki/Kimberly-Clark
/wiki/Personal_care


## Task 3 Crawl the Entire Wikipedia website

The general approach to an exhaustive site crawl is to start with the root, i.e., the home page of a website. Here, we will start with
```
https://en.wikipedia.org/
```
by retrieving all the links that appear in the home page. And then traverse each link recursively. However, the number of links is going to be very large and a link can appear in many Wikipedia article. Thus, we need to consider how to avoid repeatedly crawling the same article or page. In order to do so, we can keep a running list for easy lookups and slightly update the getLinks() function.

In [25]:
pages = set()

# Note: add a terminating condition in your code, for example,
```python
    len(pages) < 10
```
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [26]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < 10:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)

In [27]:
getLinks("")

----------------
/wiki/Main_Page
----------------
/wiki/Wikipedia:Contents
----------------
/wiki/Portal:Current_events
----------------
/wiki/Special:Random
----------------
/wiki/Wikipedia:About
----------------
/wiki/Help:Contents
----------------
/wiki/Help:Introduction
----------------
/wiki/Wikipedia:Community_portal
----------------
/wiki/Special:RecentChanges
----------------
/wiki/Wikipedia:File_upload_wizard


## Task 4 Collect data across the Wikipedia site
One purpose of traversing all the the links is to extract data. The best practice is to look at a few pages from the side and determine the patterns. By looking at a handful of Wikipedia pages both articles and non-articles pages, the following pattens can be identified:
* All titles are under h1 span tags, and these are the only h1 tags on the page. For example,
```html
<h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>
```
```html
<h1 id="firstHeading" class="firstHeading" lang="en">Main Page</h1>	
```
* All body text lives under the *div#bodyContent* tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using div#mw-content-text -> p.
* Edit links occur only on article pages. If they occur, they will be found in the *li#ca-edit tag*, under *li#ca-edit -> span -> a*

Now, the task is to further modify the getLink() function to print the title, the first paragraph and the edit link. The content from each page should be separated by 
```pyhon
print("----------------\n"+newPage)
```

In [28]:
pages = set()

##### Please also add a terminating condition in your code, for example,
```python
    len(pages) < 5
```
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [29]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("This page is missing something! No worries though!")
    
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < 5:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)

In [30]:
getLinks("") 

Main Page
<p><b><a href="/wiki/August_9" title="August 9">August 9</a></b>: <b><a href="/wiki/International_Day_of_the_World%27s_Indigenous_Peoples" title="International Day of the World's Indigenous Peoples">International Day of the World's Indigenous Peoples</a></b>; <b><a href="/wiki/National_Women%27s_Day" title="National Women's Day">National Women's Day</a></b> in South Africa (<a href="/wiki/1956" title="1956">1956</a>)
</p>
This page is missing something! No worries though!
----------------
/wiki/Main_Page
Main Page
<p><b><a href="/wiki/August_9" title="August 9">August 9</a></b>: <b><a href="/wiki/International_Day_of_the_World%27s_Indigenous_Peoples" title="International Day of the World's Indigenous Peoples">International Day of the World's Indigenous Peoples</a></b>; <b><a href="/wiki/National_Women%27s_Day" title="National Women's Day">National Women's Day</a></b> in South Africa (<a href="/wiki/1956" title="1956">1956</a>)
</p>
This page is missing something! No worries t

*** 
## Task 5 API access 
In addition to HTML format, data is commonly found on the web through public APIs. We use the 'requests' package (http://docs.python-requests.org) to call APIs using Python. In the following example, we call a public API for collecting weather data. 


** You need to sign up for a free account to get your unique API key to use in the following code. register at**  http://api.openweathermap.org

In [31]:
#Now we  use requests to retrieve the web page with our data
import requests
url = 'http://api.openweathermap.org/data/2.5/forecast?id=524901&cnt=16&APPID=1499bcd50a6310a21f11b8de4fb653a5'
#write your APPID here#
response= requests.get(url)
response.json()

{'cod': '200',
 'message': 0,
 'cnt': 16,
 'list': [{'dt': 1691550000,
   'main': {'temp': 290.8,
    'feels_like': 290.58,
    'temp_min': 290.8,
    'temp_max': 291.61,
    'pressure': 1012,
    'sea_level': 1012,
    'grnd_level': 995,
    'humidity': 75,
    'temp_kf': -0.81},
   'weather': [{'id': 803,
     'main': 'Clouds',
     'description': 'broken clouds',
     'icon': '04d'}],
   'clouds': {'all': 61},
   'wind': {'speed': 2.82, 'deg': 327, 'gust': 5.28},
   'visibility': 10000,
   'pop': 0,
   'sys': {'pod': 'd'},
   'dt_txt': '2023-08-09 03:00:00'},
  {'dt': 1691560800,
   'main': {'temp': 292.27,
    'feels_like': 292.25,
    'temp_min': 292.27,
    'temp_max': 293.21,
    'pressure': 1012,
    'sea_level': 1012,
    'grnd_level': 995,
    'humidity': 77,
    'temp_kf': -0.94},
   'weather': [{'id': 500,
     'main': 'Rain',
     'description': 'light rain',
     'icon': '10d'}],
   'clouds': {'all': 61},
   'wind': {'speed': 3.26, 'deg': 347, 'gust': 5},
   'visibility':

The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information. 

In [32]:
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print (response.status_code)

200


In [33]:
# response.content is text
print (type(response.content))

<class 'bytes'>


In [34]:
#response.json() converts the content to json 
data = response.json()
print (type(data))

<class 'dict'>


In [35]:
data.keys()

dict_keys(['cod', 'message', 'cnt', 'list', 'city'])

In [36]:
data

{'cod': '200',
 'message': 0,
 'cnt': 16,
 'list': [{'dt': 1691550000,
   'main': {'temp': 290.8,
    'feels_like': 290.58,
    'temp_min': 290.8,
    'temp_max': 291.61,
    'pressure': 1012,
    'sea_level': 1012,
    'grnd_level': 995,
    'humidity': 75,
    'temp_kf': -0.81},
   'weather': [{'id': 803,
     'main': 'Clouds',
     'description': 'broken clouds',
     'icon': '04d'}],
   'clouds': {'all': 61},
   'wind': {'speed': 2.82, 'deg': 327, 'gust': 5.28},
   'visibility': 10000,
   'pop': 0,
   'sys': {'pod': 'd'},
   'dt_txt': '2023-08-09 03:00:00'},
  {'dt': 1691560800,
   'main': {'temp': 292.27,
    'feels_like': 292.25,
    'temp_min': 292.27,
    'temp_max': 293.21,
    'pressure': 1012,
    'sea_level': 1012,
    'grnd_level': 995,
    'humidity': 77,
    'temp_kf': -0.94},
   'weather': [{'id': 500,
     'main': 'Rain',
     'description': 'light rain',
     'icon': '10d'}],
   'clouds': {'all': 61},
   'wind': {'speed': 3.26, 'deg': 347, 'gust': 5},
   'visibility':

The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'. 

In [37]:
data['list'][15]

{'dt': 1691712000,
 'main': {'temp': 291.14,
  'feels_like': 290.72,
  'temp_min': 291.14,
  'temp_max': 291.14,
  'pressure': 1023,
  'sea_level': 1023,
  'grnd_level': 1006,
  'humidity': 66,
  'temp_kf': 0},
 'weather': [{'id': 803,
   'main': 'Clouds',
   'description': 'broken clouds',
   'icon': '04n'}],
 'clouds': {'all': 78},
 'wind': {'speed': 2.32, 'deg': 47, 'gust': 4.96},
 'visibility': 10000,
 'pop': 0,
 'sys': {'pod': 'n'},
 'dt_txt': '2023-08-11 00:00:00'}

The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data

In [38]:
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])
weather_table_all


Unnamed: 0,dt,main,weather,clouds,wind,visibility,pop,sys,dt_txt,rain
0,1691550000,"{'temp': 290.8, 'feels_like': 290.58, 'temp_mi...","[{'id': 803, 'main': 'Clouds', 'description': ...",{'all': 61},"{'speed': 2.82, 'deg': 327, 'gust': 5.28}",10000,0.0,{'pod': 'd'},2023-08-09 03:00:00,
1,1691560800,"{'temp': 292.27, 'feels_like': 292.25, 'temp_m...","[{'id': 500, 'main': 'Rain', 'description': 'l...",{'all': 61},"{'speed': 3.26, 'deg': 347, 'gust': 5}",10000,0.32,{'pod': 'd'},2023-08-09 06:00:00,{'3h': 0.18}
2,1691571600,"{'temp': 294.68, 'feels_like': 294.93, 'temp_m...","[{'id': 500, 'main': 'Rain', 'description': 'l...",{'all': 91},"{'speed': 3.03, 'deg': 340, 'gust': 3.65}",10000,0.52,{'pod': 'd'},2023-08-09 09:00:00,{'3h': 0.57}
3,1691582400,"{'temp': 295.4, 'feels_like': 295.59, 'temp_mi...","[{'id': 500, 'main': 'Rain', 'description': 'l...",{'all': 95},"{'speed': 2.36, 'deg': 302, 'gust': 3.79}",10000,0.88,{'pod': 'd'},2023-08-09 12:00:00,{'3h': 1.3}
4,1691593200,"{'temp': 296.07, 'feels_like': 295.98, 'temp_m...","[{'id': 500, 'main': 'Rain', 'description': 'l...",{'all': 54},"{'speed': 3.88, 'deg': 287, 'gust': 5.42}",10000,0.84,{'pod': 'd'},2023-08-09 15:00:00,{'3h': 0.42}
5,1691604000,"{'temp': 293.22, 'feels_like': 292.9, 'temp_mi...","[{'id': 802, 'main': 'Clouds', 'description': ...",{'all': 31},"{'speed': 3.36, 'deg': 286, 'gust': 8.39}",10000,0.62,{'pod': 'n'},2023-08-09 18:00:00,
6,1691614800,"{'temp': 290.95, 'feels_like': 290.56, 'temp_m...","[{'id': 800, 'main': 'Clear', 'description': '...",{'all': 2},"{'speed': 3.03, 'deg': 284, 'gust': 7.24}",10000,0.0,{'pod': 'n'},2023-08-09 21:00:00,
7,1691625600,"{'temp': 289.6, 'feels_like': 289.18, 'temp_mi...","[{'id': 800, 'main': 'Clear', 'description': '...",{'all': 2},"{'speed': 2.54, 'deg': 284, 'gust': 5.66}",10000,0.0,{'pod': 'n'},2023-08-10 00:00:00,
8,1691636400,"{'temp': 289.51, 'feels_like': 289.08, 'temp_m...","[{'id': 800, 'main': 'Clear', 'description': '...",{'all': 0},"{'speed': 2.58, 'deg': 294, 'gust': 6.18}",10000,0.0,{'pod': 'd'},2023-08-10 03:00:00,
9,1691647200,"{'temp': 293.88, 'feels_like': 293.47, 'temp_m...","[{'id': 800, 'main': 'Clear', 'description': '...",{'all': 2},"{'speed': 2.7, 'deg': 326, 'gust': 3.88}",10000,0.0,{'pod': 'd'},2023-08-10 06:00:00,


### Discussion: 

Further parsing is still required to get the table (DataFrame) in a flat shape. Now it it's your turn, parse the weather data to generate a table.

*Please note that materials used in this tutorial are partially from the book "Web Scraping with Python"*