## Requests to a link

In [2]:
import requests

In [3]:
# example: GitHub’s public timeline
r = requests.get('https://api.github.com/events')

In [3]:
# You can find out what encoding Requests is using, and change it
r.encoding
# there are some other encoding, for example 'ISO-8859-1'

'utf-8'

In [4]:
# We can read the content of the server’s response. 
r.text



In [5]:
# You can also access the response body as bytes
r.content

b'[{"id":"22288241867","type":"CreateEvent","actor":{"id":15980475,"login":"toastkidjp","display_login":"toastkidjp","gravatar_id":"","url":"https://api.github.com/users/toastkidjp","avatar_url":"https://avatars.githubusercontent.com/u/15980475?"},"repo":{"id":100174267,"name":"toastkidjp/Yobidashi_kt","url":"https://api.github.com/repos/toastkidjp/Yobidashi_kt"},"payload":{"ref":"release/2_0_19","ref_type":"branch","master_branch":"main","description":"Web search app for Android, it\'s written with Kotlin.","pusher_type":"user"},"public":true,"created_at":"2022-06-12T00:57:49Z"},{"id":"22288241859","type":"WatchEvent","actor":{"id":107328937,"login":"LuaSolitaria","display_login":"LuaSolitaria","gravatar_id":"","url":"https://api.github.com/users/LuaSolitaria","avatar_url":"https://avatars.githubusercontent.com/u/107328937?"},"repo":{"id":123164659,"name":"Werisu/SACH","url":"https://api.github.com/repos/Werisu/SACH"},"payload":{"action":"started"},"public":true,"created_at":"2022-06-

In [5]:
# There’s also a builtin JSON decoder
r.json()

[{'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/8156884?',
   'display_login': 'willfish',
   'gravatar_id': '',
   'id': 8156884,
   'login': 'willfish',
   'url': 'https://api.github.com/users/willfish'},
  'created_at': '2022-06-15T08:28:57Z',
  'id': '22346728646',
  'org': {'avatar_url': 'https://avatars.githubusercontent.com/u/79591050?',
   'gravatar_id': '',
   'id': 79591050,
   'login': 'trade-tariff',
   'url': 'https://api.github.com/orgs/trade-tariff'},
  'payload': {'pusher_type': 'user',
   'ref': 'dependabot/npm_and_yarn/babel/plugin-transform-runtime-7.18.5',
   'ref_type': 'branch'},
  'public': True,
  'repo': {'id': 328968506,
   'name': 'trade-tariff/trade-tariff-frontend',
   'url': 'https://api.github.com/repos/trade-tariff/trade-tariff-frontend'},
  'type': 'DeleteEvent'},
 {'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/106757439?',
   'display_login': 'Tumijoseph',
   'gravatar_id': '',
   'id': 106757439,
   'login': 'Tumi

In [6]:
from urllib.request import urlopen

In [7]:
url = "http://olympus.realpython.org/profiles/aphrodite"

# urlopen() returns an HTTPResponse object:
page = urlopen(url)

# using read() to return the html body in byte
html_bytes = page.read()
# using decode() to decode the byte to text using utf-8
html = html_bytes.decode("utf-8")
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## string method

In [8]:
# title
title_index = html.find('<title>')

In [9]:
# You don’t want the index of the <title> tag, though. You want the index of the title itself.
start_index = title_index + len('<title>')
# Now get the index of the closing </title> tag by passing the string "</title>" to .find()
end_index = html.find('</title>')
start_index, end_index

(21, 39)

In [10]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

In [12]:
# another example
# http://olympus.realpython.org/profiles/poseidon
url = "http://olympus.realpython.org/profiles/poseidon"
html = urlopen(url).read().decode("utf-8")
print(html)
start_index = html.find('<title>') + len('<title>')
print(html.find('<title>'))
print(start_index)
end_index = html.find('</title>')
title = html[start_index:end_index]
title

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>

-1
6


'\n<head>\n<title >Profile: Poseidon'

## Regular Expressions
Regular expressions are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s re module.

In [13]:
import re

The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", and has zero or more instances of "b" between the two. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

In [14]:
# examples
re.findall("ab*c", "ac")
re.findall("ab*c", "abc")
re.findall("ab*c", "abbc")
re.findall("ab*c", "acc")
re.findall("ab*c", "abcac")
re.findall("ab*c", "abdc")
# if no match is found, then findall function will return an empty list
# not reporting error

[]

In [15]:
# Pattern matching is case sensitive. 
# If you want to match this pattern regardless of the case, then you can pass a third argument with the value re.IGNORECASE
re.findall("ab*c", "ABC")
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:



In [16]:
# examples
re.findall("a.c", "abc")
re.findall("a.c", "abbc")
re.findall("a.c", "ac")
re.findall("a.c", "acc")

['acc']

The pattern .* inside a regular expression stands for any character repeated any number of times. For instance, "a.*c" can be used to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:

In [17]:
# examples
re.findall("a.*c", "abc")
re.findall("a.*c", "abbc")
re.findall("a.*c", "acabc")
re.findall("a.*c", "acc")

['acc']

* Often, you use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result.
* re.search() function will search the regular expression pattern and return the first occurrence. 
* The details of the MatchObject are irrelevant here. For now, just know that calling .group() on a MatchObject will return the first and most inclusive result, which in most cases is just what you want:

In [18]:
match_results = re.search("ab*c", "ABCabbbc", re.IGNORECASE)
# return the tuple of the first occurance index
match_results.span()

# return the string passed into the function
match_results.string

# return the matched string, match_results.string[match_results.span()[0]:match_results.span()[1]]
match_results.group()

'ABC'

re.sub(), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text.

In [19]:
string = "Everything is <replaced> if it's in <tags>."
re.sub('<.*>', 'ELEPHANTS', string)

'Everything is ELEPHANTS.'

* re.sub() uses the regular expression "<.*>" to find and replace everything between the first < and last >, which spans from the beginning of <replaced> to the end of <tags>. This is because Python’s regular expressions are greedy, meaning they try to find the longest possible match when characters like * are used. (greedy search)
* Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:



In [20]:
string = "Everything is <replaced> if it's in <tags>."
re.sub('<.*?>', 'ELEPHANTS', string)

"Everything is ELEPHANTS if it's in ELEPHANTS."

### Extract Text From HTML With Regular Expressions

In [21]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
print(title)
title = re.sub("<.*?>", "", title)
print(title)

<TITLE >Profile: Dionysus</title  / >
Profile: Dionysus


## Exercise: Scrape Data From a Website [http://olympus.realpython.org/profiles/dionysus]

find the Name and Favorite Color

In [22]:
url = 'http://olympus.realpython.org/profiles/dionysus'
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")
# 'Name:', 'Favorite Color:'
string = 'Name:'
target = {'Name:':[],
          'Favorite Color:':[]}
for string in ['Name:', 'Favorite Color:']:
    # print(html_text)
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    remain_html = html_text[text_start_idx:]
    ending_idx = remain_html.find('<')
    # print(remain_html)
    text_end_idx = ending_idx + text_start_idx
    raw_text = html_text[text_start_idx:text_end_idx]
    clean_text = raw_text.strip(' \r\n\t')
    print(clean_text)
    target[string] = clean_text
print(target)

Dionysus
Wine
{'Name:': 'Dionysus', 'Favorite Color:': 'Wine'}


In [None]:
'''
\r remove the current line and start over
\n new line
\t tab
'''

In [None]:
a = 'happy\tday'
print(a)

happy	day


## Use an HTML Parser for Web Scraping in Python: BeautifulSoup


In [23]:
# !pip3 install bs4/BeautifulSoup
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [26]:
print(soup.get_text())
print(soup.prettify())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine




<html>
 <head>
  <title>
   Profile: Dionysus
  </title>
 </head>
 <body bgcolor="yellow">
  <center>
   <br/>
   <br/>
   <img src="/static/dionysus.jpg"/>
   <h2>
    Name: Dionysus
   </h2>
   <img src="/static/grapes.png"/>
   <br/>
   <br/>
   Hometown: Mount Olympus
   <br/>
   <br/>
   Favorite animal: Leopard
   <br/>
   <br/>
   Favorite Color: Wine
  </center>
 </body>
</html>



In [30]:
# some simple ways to navigate that data structure
soup.title.name
soup.title.string
soup.title.parent.name

'head'

In [35]:
# find all apperance of 'img'
soup.find_all('img')
image1, image2 = soup.find_all('img')
image1.name
image1['src'], image2['src']

('/static/dionysus.jpg', '/static/grapes.png')

In [37]:
soup.find_all('img', src='/static/dionysus.jpg')[0]

<img src="/static/dionysus.jpg"/>

In [40]:
# find the tag
soup.img
src = []
for i in soup.find_all('img'):
    src.append(i['src'])
    print(i['src'])

/static/dionysus.jpg
/static/grapes.png


## Exercise: 
write program that grabs the full html from the page: http://olympus.realpython.org/profiles

The output should look like this:
* http://olympus.realpython.org/profiles/aphrodite
* http://olympus.realpython.org/profiles/poseidon
* http://olympus.realpython.org/profiles/dionysus

In [45]:
base_url = 'http://olympus.realpython.org'
page = urlopen(base_url+'/profiles')
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
all_link_url = []
for link in soup.find_all('a'):
    link_url = base_url + link['href']
    all_link_url.append(link_url)
    print(link_url)
all_link_url

http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus


['http://olympus.realpython.org/profiles/aphrodite',
 'http://olympus.realpython.org/profiles/poseidon',
 'http://olympus.realpython.org/profiles/dionysus']

## BeautifulSoup Continued

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

In [46]:
# tag
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag

<b class="boldest">Extremely bold</b>

In [48]:
# name
tag.name
# you can change a tag’s name
tag.name = 'quotation'
tag

<quotation class="boldest">Extremely bold</quotation>

In [49]:
# A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. 
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']

'boldest'

In [50]:
# see all attributes
tag.attrs

{'id': 'boldest'}

In [52]:
# change/add attributes
tag['id'] = 'verybold'
tag['another_attribute'] = 1
tag

<b another_attribute="1" id="verybold">bold</b>

In [54]:
# delete an attribute
del tag['id']
del tag['another_attribute']
tag

<b>bold</b>

In [56]:
# find id?
# tag['id']

In [58]:
# by using .get function, you can avoid getting error message when attr doesn't exist
print(tag.get('id'))

None


In [59]:
# Multi-valued attributes
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
css_soup.p['class']

['body']

In [60]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']

['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone

In [61]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# ['my', 'id']

'my id'

In [63]:
# When you turn a tag back into a string, multiple attribute values are consolidated:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
print(rel_soup.a['rel'])
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

['index']
<p>Back to the <a rel="index contents">homepage</a></p>


In [64]:
# If you parse a document as XML, there are no multi-valued attributes:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

'body strikeout'

In [65]:
# searching the tree
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [67]:
# find all tag b
soup.find_all('b')[0]

<b>The Dormouse's story</b>

In [68]:
# finding all tag starts with b
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

body
b


In [69]:
# finding all tag with/contains t
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html
title


In [70]:
# pass in a list to the find_all function, 
# it is going to find all matches of each element
soup.find_all(["a", "b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [71]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [78]:
soup.find_all("p")
# soup.find_all('p', 'title')
# soup.find_all('p', class_='title')

# you do not have to include the tag name in the find_all fundtion
# you can put into tagname/any attributes you want into the find_all function
soup.find_all(id='link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [80]:
# find, find_all function, you can specify the section you want to search on
soup.find(string=re.compile("sisters"))
soup.find_all(href=re.compile("elsie"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [81]:
# find all tag that has the attr of id
soup.find_all(id=True)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## Practice:
* send a request to bilibili
* using html parser
* find all item with name div and class_='item', style=False
* print the link in href is exists
* save the links into a list

In [85]:
# https://www.bilibili.com/

import requests
from bs4 import BeautifulSoup

link = 'https://www.bilibili.com/'

r = requests.get(link)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, "html.parser")
# print(soup.prettify())

# finding the parameter/attributes of your target section
tags = soup.find_all('div', class_='item', style=False)

In [91]:
# tag.get
links = []
for tag in tags:
    try:
        if str(tag.a['href'])!='javascript:;':
            links.append(str(tag.a['href']))
    except:
        pass

In [92]:
links

['//www.bilibili.com/v/douga/',
 '//www.bilibili.com/anime/',
 '//www.bilibili.com/v/music/',
 '//www.bilibili.com/guochuang/',
 '//www.bilibili.com/v/dance/',
 '//www.bilibili.com/v/game/',
 '//www.bilibili.com/v/knowledge/',
 '//www.bilibili.com/v/tech/',
 '//www.bilibili.com/v/life/',
 '//www.bilibili.com/v/kichiku/',
 '//www.bilibili.com/v/fashion/',
 '//www.bilibili.com/v/information/',
 '//www.bilibili.com/v/ent/',
 '//www.bilibili.com/v/cinephile/',
 '//www.bilibili.com/cinema/',
 '//www.bilibili.com/read/home',
 '//live.bilibili.com',
 '//www.bilibili.com/blackboard/activity-list.html',
 '//www.bilibili.com/cheese/',
 'https://www.bilibili.com/blackboard/activity-5zJxM3spoS.html',
 '//www.bilibili.com/v/musicplus/']