![StatModels](https://www.durhamtech.edu/themes/custom/durhamtech/images/durham-tech-logo-web.svg) 

## Web Scraping

This lecture provides foundational knowledge and examples of scraping data from live websites.

---

# Table of Contents

### Jupyter Overview
#### <a href='#1'>Useful Links</a>
#### <a href='#2'>Introduction to Jupyter Notebooks</a>
#### <a href='#3'>Cell Types</a>
* Markdown 
* Code
    1. Running One Cell
    2. Other Run Options

#### <a href='#4'>Tips and Tricks</a>

#### <a href='#55'>Weekly Readings/Videos</a>
#### <a href='#56'>Extra Practice</a>

## What is web scraping and why do you need it?
Web scraping is a catch all term for gathering information directly from web pages on the internet.  Unlike structured APIs and Databases, web scraping will require you to get creative in how you approach collecting the data as no two projects will be the same.  As an important addition to webscraping, this lecture will also cover basic use of the selenium package, which will enable you to tranverse and interact with the vast majority of moder webpages, ranging from obscure pages all the way to Facebook.

In general, web scraping should be your last resort when looking to automate a data gathering exercise, as websites are prone to change over time, many web hosts do not appreciate web scraping, and, it's generally much slower than just working with prebuilt data sources.  

## Needed Packages

In [47]:
import pandas as pd
import warnings
import selenium
from bs4 import *
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
import numpy as np
from random import randint

warnings.filterwarnings('ignore')

## Scraping ESPN

In [12]:
##define website url
url = 'https://www.espn.com/nba/boxscore/_/gameId/401360104'

In [13]:
##Open browser
browser = webdriver.Chrome(executable_path = 'chromedriver.exe')
browser.get(url)

In [14]:
##Close browser
browser.close()

In [16]:
##Open browser without actually seeing it, also known as 'headless' browsing
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)


In [17]:
##Grab the raw webpage
innerHTML = browser.execute_script("return document.body.innerHTML")

In [18]:
##Parse raw webpage and close the browser (Don't want to gunk up your RAM!!)
soup = BeautifulSoup(innerHTML,"html.parser")
browser.close()

In [19]:
##Find all tables on page
tables = soup.findAll("table")


In [36]:
##Grab Stats for all players on both teams & convert to dataframe
df = []
for table in tables:
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        cells = [ele.text.strip() for ele in cells]
        if len(cells) == 15 and cells[0] != '' and cells[0] != 'TEAM':
            df.append(cells[0:3])
df = pd.DataFrame(df, columns = ['Player','Minutes Played','FG'])
df.head()

Unnamed: 0,Player,Minutes Played,FG
0,J. GrantJ. GrantSF,33,5-16
1,S. BeyS. BeySF,23,4-12
2,I. StewartI. StewartC,27,2-6
3,C. JosephC. JosephPG,25,3-6
4,C. CunninghamC. CunninghamPG,28,3-13


## A note on anonymous web scraping and browsing
When web scraping, many websites will blacklist your IP address in an effort to prevent you from abusing their sites.  If you are at a job and your employer is needing the data, an IP ban is not something you want ruining your day.  A common work around is to simply use a proxy server so that the website doesn't know your actual IP address.  The below will give the framework to put a proxy server between you and the internet calls you are making with python.  

In [78]:
##Proxy server list
## 'https://proxydig.com/free-proxy-list/'

In [5]:
##Connect to a website using a proxy server
PROXY = '209.165.163.187:3128'  ##Note that if you cannot connect to a webpage, try using a different proxy server from the site above
url = 'https://whatismyipaddress.com/'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

##NOTE: Using proxy servers will avoid most common website black listing BUT, does not substitute a VPN for security NOR
##should you attempt to use proxy servers for nefarious activity

While proxy servers enable you to avoid blacklisting, they leave you exposed from a security standpoint.  There are two common methods to add a layer of security.  The first is to always work with a VPN on before doing any web scraping, this will ensure even if someone tries to trace back to the original IP, only the VPN's will appear.  Alternatively, Selenium allows you to use the Tor browser to make internet calls.  This method is generally extremely slow, and unnecessary, but a fun exercise if you have a couple hours to spare! 

## -------------PRACTICE-------------

1. Create a function that will open and return a headless web browser, think of variables that may be helpful to send the function.  

2. Go to: https://www.macrotrends.net/1333/historical-gold-prices-100-year-chart and scrape the table titled 'Gold Prices - Historical Annual Data'

3.  

## Creating an email address through python only


In [28]:
##Open webpage
url = 'https://mail.tutanota.com/login'
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [29]:
##Select 'More' so we can create an account
more_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[3]/div/button')
more_button.click()

In [31]:
##Select 'Sign Up' so we can create an account
sign_up_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[4]/div/div/div/button[1]/div')
sign_up_button.click()

In [32]:
##Select the Free option
free_button = browser.find_element_by_xpath('//*[@id="upgrade-account-dialog"]/div[2]/div[1]/div[1]/div[5]/button/div')
free_button.click()

In [33]:
##Agree to terms
term_1 = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[2]/div[1]/div/input')
term_1.click()
term_2 = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[2]/div[2]/div/input')
term_2.click()
ok_button = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[3]/button[2]/div')
ok_button.click()


In [37]:
##Add Account Info
email_add = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[1]/div/div/div/div[1]/input')
email_add.click()
email_add.send_keys('test_user_abc_123') ###You will need to put in a new username!

password = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[2]/div[1]/div/div/div/div[1]/input[4]')
password.click()
password.send_keys('Sample_password!')

sec_password = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[2]/div[3]/div/div/div/div/input')
sec_password.click()
sec_password.send_keys('Sample_password!')


In [38]:
##Agree to terms 2
term_1 = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[3]/div/input')
term_1.click()
term_2 = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[4]/div/input')
term_2.click()
ok_button = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[5]/button/div')
ok_button.click()

In [39]:
##Continue
ok_button = browser.find_element_by_xpath('//*[@id="wizardDialogContent"]/div[4]/div/button/div')
ok_button.click()

In [40]:
##Type in Password
password_input = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[1]/form/div[2]/div/div/div/div/div/input')
password_input.click()
password_input.send_keys('Sample_password!')


In [41]:
##Login
login_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[1]/form/div[4]/button/div')
login_button.click()

## -------------PRACTICE-------------

## Download Consumer Spending Data

In [48]:
##Open Webpage and make full screen
url = 'https://www.bea.gov/'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--start-maximized")
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [49]:
##Navigate to Data section
data = browser.find_element_by_link_text('Data')
data.click()

In [51]:
##Navigate to Data by Topic section
data = browser.find_element_by_partial_link_text('Data by Topic')
data.click()

In [52]:
##Navigate to Consumer Spending Section
data = browser.find_element_by_link_text('Consumer Spending')
data.click()

In [53]:
##Navigate to next Consumer Spending Section
data = browser.find_element_by_xpath('//*[@id="test"]/div[2]/article/div/div/div/ul/li[1]/a')
data.click()

In [54]:
##Navigate to Interactive Data
data = browser.find_element_by_link_text('Interactive Data')
data.click()

In [55]:
##Navigate to Summary Tables
data = browser.find_element_by_link_text('Summary Tables')
data.click()

In [56]:
##Navigate to Person Income and Outlays
data = browser.find_element_by_partial_link_text('PERSONAL INCOME AND OUTLAYS')
data.click()

In [57]:
##Navigate to table 2.2A
data = browser.find_element_by_partial_link_text('2.2A')
data.click()

In [59]:
##Download Data
data = browser.find_element_by_xpath('//*[@id="showDownload"]')
data.click()

In [60]:
##Select Download Format
data = browser.find_element_by_xpath('//*[@id="download_wraper"]/div/a[2]')
data.click()
browser.close()

## -------------PRACTICE-------------

## Scrape current US National Debt

In [61]:
##Open Webpage and make full screen
url = 'https://www.usdebtclock.org/'
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [62]:
##Grab the raw webpage
innerHTML = browser.execute_script("return document.body.innerHTML")

In [63]:
##Parse raw webpage and close the browser (Don't want to gunk up your RAM!!)
soup = BeautifulSoup(innerHTML,"html.parser")
browser.close()

In [69]:
##Scrape the First number showing the total debt
divs = soup.findAll("div")
count = 1
for div in divs:
    for row in div.findAll("span"):
        if count == 1:
            print(row.text.strip())
            count+=1
        else:
            break

$28,996,631,461,201
