<a href="https://colab.research.google.com/github/BrianNguyen0305/Web_Scraping_Workshop_BN_2021/blob/main/Web_Scraping_Workshop_BN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping W/ Python**
SPRING 2021 ANALYTICS WORKSHOP SERIES - TOPIC 2 <br/>
UTDMSBA - BALC <br/>
Brian Nguyen



---

## Agenda:


1.   HTML Basics
2.   Chrome DevTools
3.   Compare Web-Scraping Packages 
4.   BeautifulSoup 
 * Simple Exercise
5.   Selenium 
 * A-tad-harder Exercise
6. Advanced Scraping and Crawling Demo
7. Ethical & Efficiency Discussion


## Goals:

1. Inspect an HTML page & Identify what to scrape.
2. Scrape with requests and BeautifulSoup.
3. Drive web crawling with Selenium.
4. How to be a responsible Scraper. 

## Why Scrape?
1. Build Datasets:
  * Texts
  * Numbers
  * Images
2. For Analysis 📈
  * Sales
  * Marketing
3. For Machine Learning 🤖
4. End-to-end testing.
5. Etc.

---

## I. HTML Basics

### 1. Overview of HTML:
1. Hypertext Markup Language (HTML)
2. Standard markup language for documents.
3. Instruct web browser how to display content. 
   * Provide structure.
   * Cascading Style Sheets (CSS) = Style.
   * JavaScript (or any script) = Interactive.
4. Tags < > are the Elements. 
   *Paired
   * Start: `<head>`
   * End: `</head>`

### 2. Common HTML Tags: 
1. `<!DOCTYPE html>` declaration defines this document to be HTML5.
2. `<html>` element is the root element of an HTML page.
3. `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
4. `<head>` element contains meta information about the document.
5. `<title>` element specifies a title for the document.
6. `<body>` element contains the visible page content.
7. `<h1>` element defines a large heading.
8. `<p>` element defines a paragraph.
9. `<a>` element defines a hyperlink. (look for `<href>`
10. And Many More!

### 3. Make a Simple HTML Page 

Notes:
1. Note Repetitions
2. Note Styles
3. Note Buttons

Tasks:
1. Change Size for Heading 2, 3
2. Fix Link 2, 3
3. Fix Typo in Title ("ZZZZZZ")



In [1]:
from IPython.core.display import display, HTML

In [None]:
display(HTML("""<!doctype html>
<html lang="en">
   <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
      <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" ">
      <title>My Courses</title>
   </head>
   <body>
      <h1>Let's Get Scrapin' zzzzzzzzzzzzz !</h1>
      <div class="card" id="card-python-for-beginners">
         <div class="card-header">
            BeautifulSoup
         </div>
         <div class="card-body">
            <h3 class="card-title">Web-Scraping for beginners</h5>
            <p class="card-text">If you are new to web-scraping, you should learn this!</p>
              <p>Ordered list:</p>
              <ol>
                 <li>Data collection</li>
                 <li>Exploratory data analysis</li>
                 <li>Data analysis</li>
                 <li>Policy recommendations</li>
               </ol>
  <hr>
            <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" class="btn btn-primary" >Start for 20$</a>
         </div>
      </div>
      <div class="card" id="card-python-web-development">
         <div class="card-header">
            Selenium
         </div>
         <div class="card-body">
            <h4 class="card-title">Web Automation with Selenium</h5>
            <p class="card-text">If you feel enough confident with XPath, you are ready to learn the most versatile web-scraping tool!</p>
              <p style="color:red">
               Add colour to your paragraphs.
               </p>
            <a href="#" class="btn btn-primary">Start for 50$</a>
         </div>
      </div>
        <div class="card" id="card-python-machine-learning">
         <div class="card-header">
            Scrapy
         </div>
         <div class="card-body">
            <h5 class="card-title">Master Web-Scraping with Scrapy</h5>
            <p class="card-text">Become a Web-Scraping master!</p>
              <p>
              That's a text paragraph. You can also <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del> and <i>emphasize</i> words.
              You can also add links - here's one to <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a>.
              </p>
            <a href="#" class="btn btn-primary">Start for 100$</a>
         </div>
      </div>
   </body>
</html>

"""))

### 4. ChromeDevTools

Overview:
1. Built in to Chrome.
2. Super useful tool:
   * View Source
   * Inspect Elements
   * Edit Webpage
3. Equivalence available for other browsers. 

Quick Exercise: <br/>
What is your favourite Website? <br/>
a. IMDB <br/>
b. Associated Press <br/>
c. Reddit <br/>
d. LinkedIn <br/>

Tasks:
1. Find Logo
2. Find Text
3. Find a Button 

Shortcuts:
Command + Option + C (Mac) or Control + Shift + C (Windows) or F12

## II. Web-Scraping W/ BeautifulSoup 🍲

### Overview:

1. Requests access, collect page source (all code).
2. BeautifulSoup Is:
   * Python Library.
   * Extract HTML, XML files.
   * Navigate and Scrape Webpage’s Tree structure.


### Simple Exercise:


In [31]:
# Imports
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# AP's homepage
ap_url = 'https://apnews.com/'

# Use requests to retrieve data from a given URL
ap_response = requests.get(ap_url)

# Parse the whole HTML page using BeautifulSoup
# Second Argument is the parser method. Here let's try "html.parser" first
ap_soup = BeautifulSoup(ap_response.text, 'html.parser')

# Take a look at the mess that is ap_soup right now
print(ap_soup.prettify()) 

In [117]:
# Title of the parsed page
ap_soup.title

<title data-rh="true">Associated Press News</title>

In [118]:
# We can also get it without the HTML tags
ap_soup.title.string

'Associated Press News'

### "LXML" Parser Method + For Loop
Find() VS Find_all()

We will use the `.find_all()` method to search the HTML tree for particular tags and get a `list` with all the relevant objects.

In [141]:
# Collect all code again but his time with the best parser method called 'lxml'
ap_soup = BeautifulSoup(ap_response.text, 'lxml')


In [142]:
# Find top story (first one only)
top = ap_soup.find('h1')

# Show what we got
top.text

'AP source: Suspect in Capitol attack suffered delusions'

In [None]:
# Find all top stories 
top_all = ap_soup.find_all('h1')

# Show what we got
top_all

In [None]:
for story in top_all:
  title = story.text
  print(title)

## III. Web-Scraping W/ Selenium:

### Overview:
1. The most versatile of all web-scraper.
2. In the right hand, it can become a Powerful Web Automator (Driver)
3. Only one can read JavaScript easily.
4. Can be very efficient when combined w/ Scrapy.
5. IMO, Best Combo Right Now: Selenium + XPAth


### "A-tad-harder" Exercise

Colab is not the optimal environment for Selenium and Chromium driver <br/>
Let's do a DEMO

In [None]:
# Install Selenium
%pip install selenium

# Part of Scrapy, Parsel lets you extract data from XML/HTML documents using XPath or CSS selector
%pip install parsel


In [None]:
# Install Driver
!apt update
!apt install chromium-chromedriver

In [154]:
# Initiate and set options
from parsel import Selector
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)

In [179]:
# URL to use in Selenium
driver.get('https://www.boxofficemojo.com/year/?ref_=bo_nb_di_secondarytab')

# assigning the source code for the webpage to variable sel
sel = Selector(text=driver.page_source)

# xpath to extract the text from the class containing the movie title and year
name = sel.xpath('//*[starts-with(@class, "a-link-normal")]/text()').getall()
print(name)

['Total Gross', '%± LY', 'Releases', 'Average', 'Total Gross', '%± LY', 'Releases', 'Average', '2021', 'Tom and Jerry', '2020', 'Bad Boys for Life', '2019', 'Avengers: Endgame', '2018', 'Black Panther', '2017', 'Star Wars: Episode VIII - The Last Jedi', '2016', 'Finding Dory', '2015', 'Jurassic World', '2014', 'Guardians of the Galaxy', '2013', 'Iron Man 3', '2012', 'The Avengers', '2011', 'Harry Potter and the Deathly Hallows: Part 2', '2010', 'Avatar', '2009', 'Transformers: Revenge of the Fallen', '2008', 'The Dark Knight', '2007', 'Spider-Man 3', '2006', "Pirates of the Caribbean: Dead Man's Chest", '2005', 'Star Wars: Episode III - Revenge of the Sith', '2004', 'Shrek 2', '2003', 'Finding Nemo', '2002', 'Spider-Man', '2001', "Harry Potter and the Sorcerer's Stone", '2000', 'How the Grinch Stole Christmas', '1999', 'Star Wars: Episode I - The Phantom Menace', '1998', 'Titanic', '1997', 'Men in Black', '1996', 'Independence Day', '1995', 'Batman Forever', '1994', 'The Lion King', '1

In [180]:
# xpath to extract the text from the class containing the movie Total Gross
gross = sel.xpath('//*[starts-with(@class, "a-text-right mojo-field-type-money")]/text()').getall()
print(gross)

['$244,468,683', '$2,222,442', '$2,085,651,481', '$4,593,945', '$11,320,874,529', '$12,426,865', '$11,889,341,443', '$11,973,153', '$11,072,815,067', '$12,996,261', '$11,377,066,920', '$13,290,966', '$11,125,835,068', '$13,151,105', '$10,359,575,749', '$12,202,091', '$10,922,051,943', '$13,222,823', '$10,822,806,722', '$13,411,160', '$10,173,621,826', '$13,936,468', '$10,566,830,616', '$16,231,690', '$10,590,200,693', '$16,393,499', '$9,629,131,592', '$13,281,560', '$9,657,106,911', '$12,460,783', '$9,208,611,128', '$12,343,982', '$8,837,713,363', '$13,073,540', '$9,365,047,036', '$13,378,638', '$9,210,978,005', '$13,809,562', '$9,165,532,414', '$16,079,881', '$8,110,859,106', '$19,638,884', '$7,511,547,085', '$17,110,585', '$7,377,967,100', '$16,468,676', '$6,725,527,166', '$20,136,308', '$6,156,263,535', '$19,858,914', '$5,647,751,531', '$18,456,704', '$5,199,428,915', '$17,867,453', '$5,101,025,737', '$19,695,080', '$4,860,902,708', '$18,205,628', '$4,556,151,332', '$18,445,956', '$