Module 3: web access

In [1]:
# importing modules
import urllib.request as ur
import requests
from bs4 import BeautifulSoup
import re
import time

1. Example of using urllib to access web page

In [2]:
page = ur.urlopen("https://www.google.com/")
print(type(page))  # url Response object
print(type(page.read()))  # read entire page as string of bytes or ascii characters
print("\nHeader content:")
print(page.getheader("Content-Type"))  # type of data:  text/html; charset=ISO8859-1
print("\nDictionary of header:")
for key, value in page.getheaders():  # see all headers
    print(key + ":", value)
print("\nAnother way to get header info:")
print(page.info())  # see all headers

<class 'http.client.HTTPResponse'>
<class 'bytes'>

Header content:
text/html; charset=ISO-8859-1

Dictionary of header:
Date: Tue, 16 May 2023 19:28:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Content-Security-Policy-Report-Only: object-src 'none';base-uri 'self';script-src 'nonce-RqhNwJXl74AFNSxUr_8RlQ' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2023-05-16-19; expires=Thu, 15-Jun-2023 19:28:28 GMT; path=/; domain=.google.com; Secure
Set-Cookie: AEC=AUEFqZfQx9UOJHuANWB5Ecqg5Z68d8kNKoTK9Mm3BnLksV0GH7SFFgvKjA; expires=Sun, 12-Nov-2023 19:28:28 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
Set-Cookie: NID=511=rJtFfFeIzO61d4NDbw8ntJ1nqGyb4LBYA3emewKEF36a9nTQnSunBgOmrSzGGFKpGetjdCZNY4JIoZ_

2. Example of using requests to access web page

In [3]:
page = requests.get("https://python.org/")
print(page.status_code)  # status code = 200 means success
print(type(page))
text = page.text  # get text from the Response object
print(type(text))
content = page.content
print(type(content))
print(page.headers["Content-Type"])  # type of data
print(page.encoding)

200
<class 'requests.models.Response'>
<class 'str'>
<class 'bytes'>
text/html; charset=utf-8
utf-8


3. Use beautifulsoup to get data from web page. Add to the code below to print the first 1000 characters of the page.

In [4]:
# Use beautifulsoup to get data from web page. 
# Add to the code below to print the first 1000 characters of the page.
page = requests.get("https://python.org/")
soup = BeautifulSoup(page.content, "lxml")
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <!-- Google tag (gtag.js) -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());
      gtag('config', 'G-TF35YF9CVH');
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
  <link href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js" rel="prefetch"/>
  <meta content="Python.org" name="application-name"/>
  <meta con

4. Find all tag names with 'b' in the name (use regex) and print just the tag name

In [5]:
for tag in soup.find_all("b"):  # tag is exactly 'b'
    print(tag)
    print(tag.name)
    print(tag.text)

print()

for tag in soup.find_all(re.compile("b")):  # tag contains 'b' somewhere
    print(tag.name)

<b>Web Development</b>
b
Web Development
<b>GUI Development</b>
b
GUI Development
<b>Scientific and Numeric</b>
b
Scientific and Numeric
<b>Software Development</b>
b
Software Development
<b>System Administration</b>
b
System Administration

body
label
button
blockquote
table
tbody
b
b
b
b
b


5. Find all tag names that are 'image' or 'table' and print the entire tag element

In [6]:
for tag in soup.find_all(["image", "table"]):  # 'image' is not a valid tag so nothing shows up
    print(tag)

<table border="0" cellpadding="0" cellspacing="0" class="quote-from" width="100%">
<tbody>
<tr>
<td><p><a href="/success-stories/saving-the-world-with-open-data-and-python/">Saving the world with Open Data and Python</a> <em>by James Baster</em></p></td>
</tr>
</tbody>
</table>


6. Find and print all https links

In [7]:
for link in soup.find_all("a"):  # look for tag 'a'
    if link["href"][0] == "h":  # then look for 'https' inside 'href'
        # or: link['href'].startswith('https')
        # or: 'https' in link['href']
        print(link["href"])

https://www.python.org/psf/
https://docs.python.org
https://pypi.org/
https://psfmember.org/civicrm/contribute/transact?reset=1&id=2
https://www.facebook.com/pythonlang?fref=ts
https://twitter.com/ThePSF
http://brochure.getpython.info/
https://docs.python.org/3/license.html
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
http://wiki.python.org/moin/Languages
http://python.org/dev/peps/
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/
http://pyfound.blogspot.com/
http://pycon.blogspot.com/
http://planetpython.org/
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
https://docs.python.org
https://blog.python.org
https://pyfound.blogspot.com/2023/05/psf-board-election-dates-for-2023.html
https://pyfound.blogspot.com/2023/05/python-humble-bundle.html
https://pyfound.blogspot.com/2023/04/thank-you-for-many-years-of-se

7. Print all department names that are listed at the web page: http://deanza.edu/buscs/

In [8]:
## MY VERSION

# Print all department names that are listed at the web page: http://deanza.edu/buscs/
# departments are a bulleted list following the header, "Department Websites"

# Get the page and create a BeautifulSoup object with the page's content
page_DA = requests.get("http://deanza.edu/buscs/")
soup_DA = BeautifulSoup(page_DA.content, "lxml")

# Find the header with the text "Department Websites"
dept_header = soup_DA.find("h3", string="Department Websites")

# print the header
print(dept_header.text, ":")

# Find the unordered list that follows the header
dept_list = dept_header.find_next_sibling("ul")

# Find all the list items in the unordered list
dept_list_items = dept_list.find_all("li")

# Print the text of each list item
for item in dept_list_items:
    print("-", item.text)

Department Websites :
- Accounting
- Automotive Technology
- Business
- Computer Information Systems
- Design and Manufacturing Technologies (DMT) (includes Computer Aided Design and Digital Imaging; Manufacturing and CNC Technology)
                                    
- Real Estate


In [9]:
## IN CLASS VERSION
# Print all department names that are listed at the web page: http://deanza.edu/buscs/
# departments are a bulleted list following the header, "Department Websites"

# get the page and create a beautiful soup object
page = requests.get("http://deanza.edu/buscs/")
soup = soup_DA = BeautifulSoup(page.content, "lxml")

div = soup.find("div", class_="col-xs-12 col-md-8 l-content")
for element in div.find_all("li"):
    for link in element.find_all("a"):
        print(link.text)


Accounting
Automotive Technology
Business
Computer Information Systems
Design and Manufacturing Technologies (DMT)
Real Estate


In class version #2

In [10]:
# Print all department names that are listed at the web page: http://deanza.edu/buscs/
# departments are a bulleted list following the header, "Department Websites"

# get the page and create a beautiful soup object
page = requests.get("http://deanza.edu/buscs/")
soup = soup_DA = BeautifulSoup(page.content, "lxml")

# find the header with the text "Department Websites" and print it
for link in soup.select("div .col-xs-12.col-md-8.l-content li a"):
    print(link.text)



Accounting
Automotive Technology
Business
Computer Information Systems
Design and Manufacturing Technologies (DMT)
Real Estate


<p>8. Use the API of the ISS (International Space Station) to see where the ISS is currently located.
    <li>a. google:  iss api</li>
    <li>b. go to the link (already coded below) and fetch the data</li>
    <li>c. print the data. What data type is it?</li>
    <li>d. print in nice format: time, latitude, longitude    (for time, use time.ctime(t)  to convert to a character time string)</li></p>

In [13]:
# Provided code
page = requests.get("http://api.open-notify.org/iss-now.json")
data = page.json()

# Print the data
print("Data type:", type(data), "\n")

# Extract information from the data
timestamp = data["timestamp"]
latitude = data["iss_position"]["latitude"]
longitude = data["iss_position"]["longitude"]

# Print the information in a nice format
print("Current time:", time.ctime(timestamp))
print("Latitude:", latitude)
print("Longitude:", longitude)

Data type: <class 'dict'> 

Current time: Tue May 16 12:29:35 2023
Latitude: 44.5554
Longitude: 86.1458


In [12]:
## IN CLASS VERSION

page = requests.get("http://api.open-notify.org/iss-now.json")
data = page.json()
print("Data:", data, "\n")
print("Data type:", type(data), "\n")



Data: {'iss_position': {'latitude': '46.4728', 'longitude': '81.1684'}, 'timestamp': 1684265311, 'message': 'success'} 

Data type: <class 'dict'> 



An example of going further is to install and use the geopy module: https://pypi.python.org/pypi/geopy
which will return a map location for a particular latitude,longitude