## Requests to a link

In [1]:
import requests

In [2]:
# example: GitHub’s public timeline
r = requests.get('https://api.github.com/events')

In [3]:
# You can find out what encoding Requests is using, and change it
r.encoding
# there are some other encoding, for example 'ISO-8859-1'

'utf-8'

In [4]:
# We can read the content of the server’s response. 
r.text

'[{"id":"22157292574","type":"CreateEvent","actor":{"id":43370844,"login":"Horizon101011","display_login":"Horizon101011","gravatar_id":"","url":"https://api.github.com/users/Horizon101011","avatar_url":"https://avatars.githubusercontent.com/u/43370844?"},"repo":{"id":461947642,"name":"Horizon101011/AzurLaneAutoScript","url":"https://api.github.com/repos/Horizon101011/AzurLaneAutoScript"},"payload":{"ref":"emotion","ref_type":"branch","master_branch":"master","description":"碧蓝航线脚本 Azur Lane automation bot (CN/EN/JP/TW) | 无缝委托科研，全自动大世界，马上使用马上接管全部游戏玩法，计划作战模式？感觉不如Alas......好用","pusher_type":"user"},"public":true,"created_at":"2022-06-04T01:54:52Z"},{"id":"22157292548","type":"CreateEvent","actor":{"id":31182783,"login":"jinwookh","display_login":"jinwookh","gravatar_id":"","url":"https://api.github.com/users/jinwookh","avatar_url":"https://avatars.githubusercontent.com/u/31182783?"},"repo":{"id":499698772,"name":"Over-10-Study/SRE-with-java","url":"https://api.github.com/repos/Over-10-Stu

In [5]:
# You can also access the response body as bytes
r.content

b'[{"id":"22157292574","type":"CreateEvent","actor":{"id":43370844,"login":"Horizon101011","display_login":"Horizon101011","gravatar_id":"","url":"https://api.github.com/users/Horizon101011","avatar_url":"https://avatars.githubusercontent.com/u/43370844?"},"repo":{"id":461947642,"name":"Horizon101011/AzurLaneAutoScript","url":"https://api.github.com/repos/Horizon101011/AzurLaneAutoScript"},"payload":{"ref":"emotion","ref_type":"branch","master_branch":"master","description":"\xe7\xa2\xa7\xe8\x93\x9d\xe8\x88\xaa\xe7\xba\xbf\xe8\x84\x9a\xe6\x9c\xac Azur Lane automation bot (CN/EN/JP/TW) | \xe6\x97\xa0\xe7\xbc\x9d\xe5\xa7\x94\xe6\x89\x98\xe7\xa7\x91\xe7\xa0\x94\xef\xbc\x8c\xe5\x85\xa8\xe8\x87\xaa\xe5\x8a\xa8\xe5\xa4\xa7\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x8c\xe9\xa9\xac\xe4\xb8\x8a\xe4\xbd\xbf\xe7\x94\xa8\xe9\xa9\xac\xe4\xb8\x8a\xe6\x8e\xa5\xe7\xae\xa1\xe5\x85\xa8\xe9\x83\xa8\xe6\xb8\xb8\xe6\x88\x8f\xe7\x8e\xa9\xe6\xb3\x95\xef\xbc\x8c\xe8\xae\xa1\xe5\x88\x92\xe4\xbd\x9c\xe6\x88\x98\xe6\xa8\x

In [None]:
# There’s also a builtin JSON decoder
r.json()

In [7]:
from urllib.request import urlopen

In [8]:
url = "http://olympus.realpython.org/profiles/aphrodite"

# urlopen() returns an HTTPResponse object:
page = urlopen(url)

# using read() to return the html body in byte
html_bytes = page.read()
# using decode() to decode the byte to text using utf-8
html = html_bytes.decode("utf-8")
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## string method

In [10]:
# title
title_index = html.find('<title>')

In [11]:
# You don’t want the index of the <title> tag, though. You want the index of the title itself.
start_index = title_index + len('<title>')
# Now get the index of the closing </title> tag by passing the string "</title>" to .find()
end_index = html.find('</title>')
start_index, end_index

(21, 39)

In [13]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

In [19]:
# another example
# http://olympus.realpython.org/profiles/poseidon
url = "http://olympus.realpython.org/profiles/poseidon"
html = urlopen(url).read().decode("utf-8")
print(html)
start_index = html.find('<title>') + len('<title>')
print(html.find('<title>'))
print(start_index)
end_index = html.find('</title>')
title = html[start_index:end_index]
title

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>

-1
6


'\n<head>\n<title >Profile: Poseidon'

## Regular Expressions
Regular expressions are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s re module.

In [20]:
import re

The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", and has zero or more instances of "b" between the two. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

In [26]:
# examples
re.findall("ab*c", "ac")
re.findall("ab*c", "abc")
re.findall("ab*c", "abbc")
re.findall("ab*c", "acc")
re.findall("ab*c", "abcac")
re.findall("ab*c", "abdc")
# if no match is found, then findall function will return an empty list
# not reporting error

[]

In [28]:
# Pattern matching is case sensitive. 
# If you want to match this pattern regardless of the case, then you can pass a third argument with the value re.IGNORECASE
re.findall("ab*c", "ABC")
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:



In [32]:
# examples
re.findall("a.c", "abc")
re.findall("a.c", "abbc")
re.findall("a.c", "ac")
re.findall("a.c", "acc")

['acc']

The pattern .* inside a regular expression stands for any character repeated any number of times. For instance, "a.*c" can be used to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:

In [36]:
# examples
re.findall("a.*c", "abc")
re.findall("a.*c", "abbc")
re.findall("a.*c", "acabc")
re.findall("a.*c", "acc")

['acc']

* Often, you use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result.
* re.search() function will search the regular expression pattern and return the first occurrence. 
* The details of the MatchObject are irrelevant here. For now, just know that calling .group() on a MatchObject will return the first and most inclusive result, which in most cases is just what you want:

In [43]:
match_results = re.search("ab*c", "ABCabbbc", re.IGNORECASE)
# return the tuple of the first occurance index
match_results.span()

# return the string passed into the function
match_results.string

# return the matched string, match_results.string[match_results.span()[0]:match_results.span()[1]]
match_results.group()

'ABC'

re.sub(), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text.

In [44]:
string = "Everything is <replaced> if it's in <tags>."
re.sub('<.*>', 'ELEPHANTS', string)

'Everything is ELEPHANTS.'

* re.sub() uses the regular expression "<.*>" to find and replace everything between the first < and last >, which spans from the beginning of <replaced> to the end of <tags>. This is because Python’s regular expressions are greedy, meaning they try to find the longest possible match when characters like * are used. (greedy search)
* Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:



In [45]:
string = "Everything is <replaced> if it's in <tags>."
re.sub('<.*?>', 'ELEPHANTS', string)

"Everything is ELEPHANTS if it's in ELEPHANTS."

### Extract Text From HTML With Regular Expressions

In [48]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
print(title)
title = re.sub("<.*?>", "", title)
print(title)

<TITLE >Profile: Dionysus</title  / >
Profile: Dionysus


## Exercise: Scrape Data From a Website [http://olympus.realpython.org/profiles/dionysus]

find the Name and Favorite Color

In [64]:
url = 'http://olympus.realpython.org/profiles/dionysus'
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")
# 'Name:', 'Favorite Color:'
string = 'Name:'
target = {'Name:':[],
          'Favorite Color:':[]}
for string in ['Name:', 'Favorite Color:']:
    # print(html_text)
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    remain_html = html_text[text_start_idx:]
    ending_idx = remain_html.find('<')
    # print(remain_html)
    text_end_idx = ending_idx + text_start_idx
    raw_text = html_text[text_start_idx:text_end_idx]
    clean_text = raw_text.strip(' \r\n\t')
    print(clean_text)
    target[string] = clean_text
print(target)

Dionysus
Wine
{'Name:': 'Dionysus', 'Favorite Color:': 'Wine'}


In [None]:
'''
\r remove the current line and start over
\n new line
\t tab
'''

In [59]:
a = 'happy\tday'
print(a)

happy	day


## Use an HTML Parser for Web Scraping in Python: BeautifulSoup


In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [None]:
# some simple ways to navigate that data structure


In [None]:
# find all apperance of img


In [None]:
# find the tag

## Exercise: 
write program that grabs the full html from the page: http://olympus.realpython.org/profiles

The output should look like this:
* http://olympus.realpython.org/profiles/aphrodite
* http://olympus.realpython.org/profiles/poseidon
* http://olympus.realpython.org/profiles/dionysus

## BeautifulSoup Continued

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# print(soup.get_text())
print(soup.prettify())

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

In [None]:
# tag
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b

In [None]:
# name

# you can change a tag’s name


In [None]:
# A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. 
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']

In [None]:
# see all attributes
tag.attrs

In [None]:
# change/add attributes


In [None]:
# delete an attribute


In [None]:
# find id?


In [None]:
# Multi-valued attributes
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
css_soup.p['class']

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone

In [None]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

In [None]:
# When you turn a tag back into a string, multiple attribute values are consolidated:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')

rel_soup.a['rel'] = ['index', 'contents']


In [None]:
# If you parse a document as XML, there are no multi-valued attributes:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

In [None]:
# searching the tree
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup.find_all('b')

In [None]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

In [None]:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

In [None]:
soup.find_all(["a", "b"])

In [None]:
soup.find_all("p")

In [None]:
soup.find_all("a")
soup.find_all("p", "title")
soup.find_all("p", class_="title")
soup.find_all(id="link2")

In [None]:
soup.find(string=re.compile("sisters"))
soup.find_all(href=re.compile("elsie"))

In [None]:
soup.find_all(id=True)

## Practice:
* send a request to bilibili
* using html parser
* find all item with name div and class_='item', style=False
* print the link in href is exists
* save the links into a list

In [None]:
# https://www.bilibili.com/