# Web Scraping Lecture Part1 

## Content
1. HTML language brief
2. Get webpages
3. Get contents in HTML using Regular Expression
4. Get contents in HTML using BeautifulSoup

# 1. HTML Language Brief

HTML (the Hypertext Markup Language) and CSS (Cascading Style Sheets) are two of the core technologies for building Web pages. The third tool for building up webpage is JavaScript. 

HTML provides the structure and content of the webpage,such as text, link, layout and so on. You use HTML to create the actual content of the page, HTML is the basic structure and the contents of a website. It is a nested block structure. 

CSS is responsible for the design of the webpage – how everything looks, for example, colors and where elements are on the page.

JavaScript is responsible for interactivity on a webpage which helps engage a user. You can implement various algorithms through JavaScript. 

HTML is the markup language which helps you to create and design web content. It has a variety of tag and attributes for defining the layout and structure of the web document. It is designed to display data in a formatted manner. A HTML document has the extension .htm or .html. You can edit HTML code in any basic code editor, even notepad. The edited code can be executed in any browser. Browsers render the tags used and present the content you want to display with or without applied formatting.

## Understad HTML Tags 
    <!DOCTYPE html>  
    <html>  
        <head>
        </head>
        <body>
            <h1> First Scraping </h1>
            <p> Hello World </p>
        <body>
    </html>

# 2. Get HTML

### Example1

_urllib_ is a package that collects several modules for working with URLs, such as _urllib.request_ (for opening and reading URLs), _urllib.parse_ (for pasing URLs)...

_urllib.request_ has function _urlopen()_. This function always returns an object which can work as a context manager and has methods such as geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed; info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers) ; getcode() – return the HTTP status code of the response.

https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen

In [3]:
from urllib.request import urlopen  # the library that is used to manage urls

htmlfile = urlopen("https://www.google.com/") #open web page and store it in a file

htmltext = htmlfile.read() # generate a string of entire file
print (htmltext) # this will show the source code of the webpage

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="TQgQ1ePrdMcllZdhpJ5TNg">(function(){var _g={kEI:\'4Eq5Za-hH6X9kPIP-um1EA\',kEXPI:\'0,1365467,207,4804,896,1131174,1963,868574,327186,678,380090,44798,23792,12315,17584,4998,17075,38444,2872,2891,3926,214,4208,3406,606,30668,19390,10632,15324,2025,1,16916,2652,4,59617,2980,24067,6627,7596,1,42154,2,16395,342,23024,6700,31121,4568,6256,24673,30152,2912,2,2,1,26632,8155,23351,8702,13733,9779,42458,20199,36747,3801,2412,30219,2266,764,15816,1804,7734,6072,12026,575,687,8175,11813,

In [4]:
# If you encounter decoding problem, you may try the following code. 
htmlfile = urlopen("http://google.com") #open web page and store it in a file
htmltext = htmlfile.read()
text = htmltext.decode(encoding="utf8", errors='ignore')
print (text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="e3AfoH54_ZGTfV13Soe10g">(function(){var _g={kEI:'4Eq5ZbbULOb9kPIPiaejgAI',kEXPI:'0,1365467,207,4804,1132070,1963,868573,327204,661,121,379968,35512,9287,23792,12313,17586,4998,50700,4819,2872,2891,3926,7828,606,30668,30022,15324,781,1244,1,16916,2652,4,42766,16851,2980,24028,6666,7596,1,11942,30212,2,16395,342,23024,6699,31123,4568,6255,24673,33064,2,2,1,10957,13669,2006,8155,8861,14490,8701,13734,9779,42459,20198,36747,3801,2412,25097,5122,3030,11151,4665,357,1447,7734,6626,1,114

### Example2: Get HTMLs of Multiple Webpages

Get first 500 string characters of HTML source code of three websites. 

In [5]:
from urllib.request import urlopen

urls = ["http://google.com", "http://nytimes.com", "http://www.csueastbay.edu"]

for x in urls:
    htmlfile = urlopen(x) #open web page and store it in a file
    htmltext = htmlfile.read() # string of entire file
    print (x)
    print (htmltext[: 500]) # print out the first 500 characters of each file string

http://google.com
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>'
http://nytimes.com
b'<!DOCTYPE html>\n<html lang="en" class=" nytapp-vi-homepage"  xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <meta charset="utf-8" />\n    <title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>\n    <meta data-rh="true" name="description" content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscri

# 3. Get Contents in HTML using Regular Expression

_re_, also referred to as regular expression, is a module providing regular expression matching operations similar like _Perl_. 

https://docs.python.org/3/library/re.html

In [6]:
from urllib.request import urlopen
from re import findall

#Read the webpage:
response = urlopen("https://www.espn.com/")
html = response.read()
text = html.decode()
#print(text)

#Use regular expressions to find the data we want, which looks like:
#   "<span>NN&deg;</span>" where the NN is replaced with digits.
# Note that on extremely hot days it could be NNN and on extremely
# cold days it could be just N.

dataCrop = findall("<p>(.+?)</p>", text) # get whatever in between <span>
print("The data cropped out of the webpage is:", dataCrop)

The data cropped out of the webpage is: ["During warmups, Travis Kelce throws Justin Tucker's gear to the side to make way for Patrick Mahomes, then both teams get physical near the sideline.", "Has Baker Mayfield earned a big deal? Could Tee Higgins and Chase Young find new homes? Here's our top-50 free agent ranking.", 'Our insiders makes the moves they think NBA GMs should be making in the next 10 days, including a big deal to shake up the Lakers.', "Is Xavi a bad manager? Have signings like Robert Lewandowski been mistakes? Let's look at the reasons for Barca's struggles, and figure out what's causing it.", "Harbaugh's last NFL journey ended in disappointment. The Chargers have a checkered coaching history. Can this marriage work?", 'Which side has the better starting five? Which selection was most surprising? What All-Star reserve battles should we expect? Our NBA insiders answer them all.', 'Jim Harbaugh is off to the NFL after an indelible nine-year run at Michigan. There were h

In [7]:
# If you encounter decoding problem, you may try the following code. 
htmlfile = urlopen("https://www.espn.com/") #open web page and store it in a file
htmltext = htmlfile.read()
text = htmltext.decode(encoding="utf8", errors='ignore')
#print (text)

dataCrop = findall("<span>(.+?)</span>", text)
print("The data cropped out of the webpage is:", dataCrop)

The data cropped out of the webpage is: ['Menu']


### Example: How to get titles of the following three websites?

In [8]:
from urllib.request import urlopen
import re #regex or regular expression

urls = ["http://google.com", "http://nytimes.com", "https://www.csueastbay.edu/students/index.html"]

regex = '<title>?(.+?)</title>' # get whatever in between <title>

pattern = re.compile(regex) #Compile a regular expression pattern into a regular expression object

for url in urls:
    htmlfile = urlopen(url) 
    htmltext = htmlfile.read()
    text = htmltext.decode(encoding="utf8", errors='ignore')
    title = re.findall(pattern, text) # in the file htmltext, find all that fits the pattern
    print (title)

['Google']
[' data-rh="true">The New York Times - Breaking News, US News, World News and Videos', ' id="styln-2024-election-hp-menu">2024 Election']
['Current Students | Cal State East Bay']


In [9]:
from urllib.request import urlopen
import regex #regex or regular expression
dir(regex)


['A',
 'ASCII',
 'B',
 'BESTMATCH',
 'D',
 'DEBUG',
 'DEFAULT_VERSION',
 'DOTALL',
 'E',
 'ENHANCEMATCH',
 'F',
 'FULLCASE',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'P',
 'POSIX',
 'Pattern',
 'R',
 'REVERSE',
 'Regex',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'V0',
 'V1',
 'VERBOSE',
 'VERSION0',
 'VERSION1',
 'W',
 'WORD',
 'X',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_regex',
 '_regex_core',
 'cache_all',
 'compile',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'match',
 'purge',
 'regex',
 'search',
 'split',
 'splititer',
 'sub',
 'subf',
 'subfn',
 'subn',
 'template']

In [10]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']

# 4. Get Contents in HTML using BeautifulSoup

soup.findAll('p'): To find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.

For example,
    $" <p> Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> "$

should return:
Many hundreds of named mango cultivars exist.

In [11]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS #BeautifulSoup is a Python library 
                                    #for pulling data out of HTML and XML files.

In [12]:
url = "https://www.google.com/"

htmlfile = urlopen(url) #open the url and store it to a file
soup = BS(htmlfile,'html.parser') # Create a BeautifulSoup object based on the file. 
#A BeautifulSoup object represents the input HTML/XML document used for its creation. 
#BeautifulSoup is created by passing a string or a file-like object (this can be an open handle to the files 
#stored locally in our machine or a web page).

print (soup.prettify())
#print the source code HTML completely in nested structure. 

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
 <head>
  <meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"/>
  <meta content="noodp" name="robots"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script nonce="oti5B3NPZy1dX5yD1KamPg">
   (function(){var _g={kEI:'5Eq5ZbSUBuXakPIPrL2_iAE',kEXPI:'0,18167,1347301,206,4804,1132070,1962,868574,327242,616,302568,14097,63431,44799,23792,12312,17587,4998,17075,38444,2872,2891,3926,4422,3406,606,67205,8809,2025,1,16916,2652,4,59617,2980,24064,6630,7596,1,11943,30211,2,16395,342,23024,6700,31121,4569,6258,24670,33064,2,2,1,10956,15676,8155,23351,22435,9780,12414,30044,3141,17057,36747,3801,2412,25096,5123,3030,15816

In [13]:
url = "http://www20.csueastbay.edu/news/2015/10/10232015.html"
htmlfile = urlopen(url)
htmltext = htmlfile.read()
#print (htmltext)
soup = BS(htmltext,'html.parser')
print (soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5.0" name="viewport"/>
  <title>
   CSUEB Ranks No. 47 in Social Mobility Rankings
  </title>
  <!--BEGIN: GLOBAL-SCRIPTS-HEAD-->
  <link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap-accessibility.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/styles.css?v=36" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/styles2.css" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_global/css/flexslider.css?v=3" rel="stylesheet"/>
  <link href="https://www.csueastbay.edu/_glob

In [14]:
print (soup.title)
print (soup.title.string)
print (soup.title.contents)

print(soup.p.contents)  #return the first

<title>CSUEB Ranks No. 47 in Social Mobility Rankings</title>
CSUEB Ranks No. 47 in Social Mobility Rankings
['CSUEB Ranks No. 47 in Social Mobility Rankings']
[]


In [15]:
print (soup.title.get_text()) #returns the text part of an entire document or a tag

CSUEB Ranks No. 47 in Social Mobility Rankings


In [16]:
print (soup.p) #the first tag <p> is found

<p></p>


In [17]:
soup.findAll('p')

[<p></p>,
 <p><p>In a new set of college and university rankings based on social mobility, Cal State East Bay ranked No. 47 in the nation.</p>
 <p>The <a href="http://www.socialmobilityindex.org">unique new ranking system by CollegeNet</a> attempts to measure how well institutions educate those from families making less than the national median income — and the affordability of that education.</p>
 <p>Unlike prominent rankings systems, CollegeNet’s social mobility survey does not take into account high scores on standardized tests.  </p>
 <p>Tuition and economic background are the most important variables in the rankings, though graduation rate, early career salary and endowment were also taken into account.</p>
 <p>Cal State East Bay was one of several California public schools that did well in the rankings.</p></p>,
 <p>In a new set of college and university rankings based on social mobility, Cal State East Bay ranked No. 47 in the nation.</p>,
 <p>The <a href="http://www.socialmobil

## <font color='red'>**Exercise:**</font>

What is the difference of the following codes? 

    soup.findAll('p')
    
    for tag in soup.findAll('p'):
        print (tag.contents)

In [18]:
 for tag in soup.findAll('p'):
        print (tag.contents)

[]
[<p>In a new set of college and university rankings based on social mobility, Cal State East Bay ranked No. 47 in the nation.</p>, '\n', <p>The <a href="http://www.socialmobilityindex.org">unique new ranking system by CollegeNet</a> attempts to measure how well institutions educate those from families making less than the national median income — and the affordability of that education.</p>, '\n', <p>Unlike prominent rankings systems, CollegeNet’s social mobility survey does not take into account high scores on standardized tests.  </p>, '\n', <p>Tuition and economic background are the most important variables in the rankings, though graduation rate, early career salary and endowment were also taken into account.</p>, '\n', <p>Cal State East Bay was one of several California public schools that did well in the rankings.</p>]
['In a new set of college and university rankings based on social mobility, Cal State East Bay ranked No. 47 in the nation.']
['The ', <a href="http://www.socia

Answer: 

Answer to the exercise: soup.findAll('p') returns to a list containing all the < p > tags. Also each tag ended with \n (newline). tag.contents in the for loop returns to a list containing the content of that enumerated tag. 

### Example: Code scrapter

Get the content inside of tag $<span class="footer-link">$...$</span>$. 

In [19]:
content_list=soup.findAll('span',attrs={'class':"footer-link"})
content_list
# Notice the difference of the above with the following. 
# print (content_list)

[<span class="footer-link">Additional Resources</span>,
 <span class="footer-link">Campus</span>,
 <span class="footer-link">Legal</span>,
 <span class="footer-link">Tools</span>]

In [20]:
soup.findAll('span',attrs={'class':"footer-link"})

[<span class="footer-link">Additional Resources</span>,
 <span class="footer-link">Campus</span>,
 <span class="footer-link">Legal</span>,
 <span class="footer-link">Tools</span>]

In [21]:
for tag in content_list:
    print (tag.contents)
    
# Please also print(try tag.get_text()) and print(tag)
# Compare the differences.

['Additional Resources']
['Campus']
['Legal']
['Tools']


### Example: Code scrapter

How about using *re* library to code scraper...

In [22]:
url = "http://www20.csueastbay.edu/news/2015/10/10232015.html"
htmlfile = urlopen(url)
htmltext = htmlfile.read()
#print (htmltext) #For testing
text = htmltext.decode()
regex = '<span class="footer-link">(.+?)</span>'
pattern = re.compile(regex)
#print pattern # For testing
X = re.findall(pattern, text)
print (X)

['Additional Resources', 'Campus', 'Legal', 'Tools']
