# Web Scraping - Part 2

## Content
1. HTML language brief
2. Get webpages
3. Get contents in HTML using Regular Expression
4. Get contents in HTML using BeautifulSoup

# 1. HTML Language Brief

HTML (the Hypertext Markup Language) and CSS (Cascading Style Sheets) are two of the core technologies for building Web pages. The third tool for building up webpage is JavaScript. 

HTML provides the structure and content of the webpage,such as text, link, layout and so on. You use HTML to create the actual content of the page, HTML is the basic structure and the contents of a website. It is a nested block structure. 

CSS is responsible for the design of the webpage – how everything looks, for example, colors and where elements are on the page.

JavaScript is responsible for interactivity on a webpage which helps engage a user. You can implement various algorithms through JavaScript. 

HTML is the markup language which helps you to create and design web content. It has a variety of tag and attributes for defining the layout and structure of the web document. It is designed to display data in a formatted manner. A HTML document has the extension .htm or .html. You can edit HTML code in any basic code editor, even notepad. The edited code can be executed in any browser. Browsers render the tags used and present the content you want to display with or without applied formatting.

## Understad HTML Tags 
    <!DOCTYPE html>  
    <html>  
        <head>
        </head>
        <body>
            <h1> First Scraping </h1>
            <p> Hello World </p>
        <body>
    </html>

# 2. Get HTML

### Example1

_urllib_ is a package that collects several modules for working with URLs, such as _urllib.request_ (for opening and reading URLs), _urllib.parse_ (for pasing URLs)...

_urllib.request_ has function _urlopen()_. This function always returns an object which can work as a context manager and has methods such as geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed; info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers) ; getcode() – return the HTTP status code of the response.

https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen

In [28]:
from urllib.request import urlopen  
htmlfile = urlopen("http://google.com") 
htmltext = htmlfile.read()
#print (htmltext)

In [29]:
# If you encounter decoding problem, you may try the following code. 
htmlfile = urlopen("http://google.com") 
htmltext = htmlfile.read()
text = htmltext.decode(encoding="utf8", errors='ignore')
#print (text)

### Example2: Get HTMLs of Multiple Webpages

Get first 500 string characters of HTML source code of three websites. 

In [37]:
from urllib.request import urlopen

urls = ["http://google.com", "http://nytimes.com", "http://www.csueastbay.edu"]

for x in urls:
    htmlfile = urlopen(x) 
    htmltext = htmlfile.read() 
    print (x)
    print (htmltext[: 500]) # print out the first 500 characters of each file string

http://google.com
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>'
http://nytimes.com
b'<!DOCTYPE html>\n<html lang="en-US"  xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <meta charset="utf-8" />\n    <title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>\n    <meta data-rh="true" name="description" content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. a

# 3. Get Contents in HTML using Regular Expression

_re_, also referred to as regular expression, is a module providing regular expression matching operations similar like _Perl_. 

https://docs.python.org/3/library/re.html

In [36]:
from urllib.request import urlopen
from re import findall

#Read the webpage:
response = urlopen("https://www.espn.com/")
html = response.read()
text = html.decode()
#print(text)


#we know html has many tags. we have have ine such tag called"span"
#Lets use findall function to find text between span tags
dataCrop = findall("<span>(.+?)</span>", text) 
print("The data cropped out of the webpage is:", dataCrop,"\n")
#So we have 'MENU' as text between our span tags.

print("The best way to see any tag in a better way is searchthe text between tags/\
then copy the output text and serach it in actual webiste. That will clear many things")

The data cropped out of the webpage is: ['Menu'] 

The best way to see any tag in a better way is searchthe text between tags/then copy the output text and serach it in actual webiste. That will clear many things


In [33]:
# If you encounter decoding problem, you may try the following code. 
htmlfile = urlopen("https://www.espn.com/") #open web page and store it in a file
htmltext = htmlfile.read()
text = htmltext.decode(encoding="utf8", errors='ignore')
#print (text)

dataCrop = findall("<span>(.+?)</span>", text)
print("The data cropped out of the webpage is:", dataCrop)




The data cropped out of the webpage is: ['Menu']


### Example: How to get titles of the following three websites?

In [7]:
from urllib.request import urlopen
import re #regex or regular expression

urls = ["http://google.com", "http://nytimes.com", "http://www.csueastbay.edu"]

regex = '<title>?(.+?)</title>' # get whatever in between <title>

pattern = re.compile(regex) #Compile a regular expression pattern into a regular expression object

for url in urls:
    htmlfile = urlopen(url) 
    htmltext = htmlfile.read()
    text = htmltext.decode(encoding="utf8", errors='ignore')
    title = re.findall(pattern, text) # in the file htmltext, find all that fits the pattern
    print (title)

['Google']
[' data-rh="true">The New York Times - Breaking News, US News, World News and Videos']
['California State University, East Bay']


# 4. Get Contents in HTML using BeautifulSoup

soup.findAll('p'): To find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.

For example,
    $" <p> Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> "$

should return:
Many hundreds of named mango cultivars exist.

In [8]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS 

In [9]:
url = "https://www.google.com/"
htmlfile = urlopen(url) 
soup = BS(htmlfile,'html.parser') 


#When you make soup from url .pretiify() gives you html
print (soup.prettify())
#print the source code HTML completely in nested structure. 

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
 <head>
  <meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"/>
  <meta content="noodp" name="robots"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script nonce="9p5FOu6pgWa/0ktUOjYvzA==">
   (function(){window.google={kEI:'RwUDYuLjKPDJ0PEP-aiN4AY',kEXPI:'0,1302536,56873,6059,206,4804,2316,383,246,5,1354,4013,923,315,1122515,1197732,669,380090,16114,17444,11240,17572,4859,1361,284,9006,3023,2821,1930,12835,4020,978,13227,3848,4192,6430,7432,15309,910,4171,1593,1279,2742,149,1103,840,1983,213,4101,3514,606,2023,1777,520,14670,3229,2843,7,4773,38,12639,11625,2771,1924,908,2,941,2614,12710,474,34,273,1244,1,544

In [10]:
url = "http://www20.csueastbay.edu/news/2015/10/10232015.html"
htmlfile = urlopen(url)
htmltext = htmlfile.read()
#print (htmltext)
soup = BS(htmltext,'html.parser')
print (soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5.0" name="viewport"/>
  <title>
   CSUEB Ranks No. 47 in Social Mobility Rankings
  </title>
  <!--BEGIN: GLOBAL-SCRIPTS-HEAD-->
  <link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap-accessibility.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/styles.css?v=36" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/styles2.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/flexslider.css?v=3" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_glob

In [48]:
print (soup.title)
print (soup.title.string)
print (soup.title.contents)

print(soup.p.contents)

<title>Cal State East Bay’s Top 5 Stories of the Year</title>
Cal State East Bay’s Top 5 Stories of the Year
['Cal State East Bay’s Top 5 Stories of the Year']
[<span style="font-weight: 400;">The year 2021 was one of transformation and change for Cal State East Bay and its three campuses. </span>]


In [12]:
print (soup.title.get_text()) #returns the text part of an entire document or a tag

CSUEB Ranks No. 47 in Social Mobility Rankings


In [13]:
print (soup.p) #the first tag <p> is found

<p></p>


In [44]:
soup.findAll('p')

[<p><span style="font-weight: 400;">The year 2021 was one of transformation and change for Cal State East Bay and its three campuses. </span></p>,
 <p><span style="font-weight: 400;">We began the year with a new leader — President Cathy Sandeen — and welcomed several other new faces in leadership positions as 2021 progressed. </span></p>,
 <p><span style="font-weight: 400;">After most of our faculty, staff and students spent much of 2020 learning, teaching and working from home, we welcomed the opportunity to return in person again as the COVID-19 vaccines rolled out and our campuses opened back up for Fall Semester. </span></p>,
 <p><span style="font-weight: 400;">Here are the top five stories of the year from the university’s </span><a href="https://www.csueastbay.edu/news-center/index.html"><span style="font-weight: 400;">news center</span></a><span style="font-weight: 400;"> and </span><a href="https://www.ebtoday.com/"><span style="font-weight: 400;">magazine</span></a><span style

## <font color='red'>**Exercise:**</font>

What is the difference of the following codes? 

    soup.findAll('p')
    
    for tag in soup.findAll('p'):
        print (tag.contents)

In [22]:
 for tag in soup.findAll('p'):
        print (tag.contents)

[<span style="font-weight: 400;">The year 2021 was one of transformation and change for Cal State East Bay and its three campuses. </span>]
[<span style="font-weight: 400;">We began the year with a new leader — President Cathy Sandeen — and welcomed several other new faces in leadership positions as 2021 progressed. </span>]
[<span style="font-weight: 400;">After most of our faculty, staff and students spent much of 2020 learning, teaching and working from home, we welcomed the opportunity to return in person again as the COVID-19 vaccines rolled out and our campuses opened back up for Fall Semester. </span>]
[<span style="font-weight: 400;">Here are the top five stories of the year from the university’s </span>, <a href="https://www.csueastbay.edu/news-center/index.html"><span style="font-weight: 400;">news center</span></a>, <span style="font-weight: 400;"> and </span>, <a href="https://www.ebtoday.com/"><span style="font-weight: 400;">magazine</span></a>, <span style="font-weight: 4

Answer: 

Answer to the exercise: soup.findAll('p') returns to a list containing all the < p > tags. Also each tag ended with \n (newline). tag.contents in the for loop returns to a list containing the content of that enumerated tag. 

### Example: Code scrapter

Get the content inside of tag $<span class="footer-link">$...$</span>$. 

In [23]:
content_list=soup.findAll('span',attrs={'class':"footer-link"})
content_list
# Notice the difference of the above with the following. 
# print (content_list)

[<span class="footer-link">Additional Resources</span>,
 <span class="footer-link">Campus</span>,
 <span class="footer-link">Legal</span>,
 <span class="footer-link">Tools</span>]

In [24]:
soup.findAll('span',attrs={'class':"footer-link"})

[<span class="footer-link">Additional Resources</span>,
 <span class="footer-link">Campus</span>,
 <span class="footer-link">Legal</span>,
 <span class="footer-link">Tools</span>]

In [25]:
for tag in content_list:
    print (tag.contents)
    
# Please also print(try tag.get_text()) and print(tag)
# Compare the differences.

['Additional Resources']
['Campus']
['Legal']
['Tools']


### Example: Code scrapter

How about using *re* library to code scraper...

In [49]:
url = "http://www20.csueastbay.edu/news/2015/10/10232015.html"
htmlfile = urlopen(url)
htmltext = htmlfile.read()
#print (htmltext) #For testing
text = htmltext.decode()
regex = '<span class="footer-link">(.+?)</span>'
pattern = re.compile(regex)
#print pattern # For testing
X = re.findall(pattern, text)
print (X)

['Additional Resources', 'Campus', 'Legal', 'Tools']


## <font color='red'>Exercise: getting article</font>

Given this news, https://www.ebtoday.com/stories/why-zombies-have-taken-over, can you get the title and paragraphs in this article in plain text? Note all the outputs need to be plain text without any tag language or newline symbol. 

## <font color='red'>Answer for the Exercise: getting article</font>

In [50]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS

url = "https://www.csueastbay.edu/news-center/2021/12/cal-state-east-bays-top-5-stories-of-the-year.html"
#query the website and return the html to the variable url. 
htmlfile=urlopen(url)
#parse the html using BeautifulSoup and store it in variable 'soup'
soup = BS(htmlfile,'html.parser')


print(soup.title.string.strip())

for tag in soup.findAll('p'): 
    print (tag.get_text())

#Please try following:
#for tag in soup.findAll('p'): 
#    print (tag.contents)

Cal State East Bay’s Top 5 Stories of the Year
The year 2021 was one of transformation and change for Cal State East Bay and its three campuses. 
We began the year with a new leader — President Cathy Sandeen — and welcomed several other new faces in leadership positions as 2021 progressed. 
After most of our faculty, staff and students spent much of 2020 learning, teaching and working from home, we welcomed the opportunity to return in person again as the COVID-19 vaccines rolled out and our campuses opened back up for Fall Semester. 
Here are the top five stories of the year from the university’s news center and magazine publications:
Share
25800 Carlos Bee Boulevard  |  Hayward, CA 94542  |  510-885-3000
