# Web Scraping
---------


### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-05-16  
**Last Updated:** 2023-05-19  
**Version:** 1  

### Description
This notebook should serve to educate newcomers to Python on web-scraping.
In this script, we'll cover the following popular web-scraping libraries:
- requests
- selenium

### Notes
The Selenium package is not native when you install python.
See the selenium_setup.md file within this project sub-folder for instructions on how to get set up with python and selenium.

Chat GPT was used to help create some of the documentation and code behind this script.
Credit to this tool for being great at summarizing information and aiding in the learning process for myself and others.


## 1. Using Requests and BeautifulSoup


The [Python Requests library](https://requests.readthedocs.io/en/latest/) is a popular HTTP client library that allows you to send HTTP requests using Python. With it, you can send HTTP/1.1 requests and handle the responses. This includes making HTTP requests (GET, POST, PUT, DELETE, etc.), handling query parameters, form-encoded data, files, and JSON data. The library abstracts the complexities of making requests behind a beautiful, simple API. With Requests, you can also handle cookies, sessions, and headers, all while providing thread-safety and connection pooling, making it a robust and efficient solution for interacting with web services.

[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library designed for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner. BeautifulSoup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree. It sits on top of an HTML or XML parser and provides Python-friendly representations of the parse tree, making it easier for users to parse an HTML document and extract the information they need. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8, making it highly reliable for web scraping and data extraction tasks.

In [1]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define URL
url = "https://www.ssa.gov/oact/STATS/table4c6.html"

# Send HTTP request to URL
response = requests.get(url)

# Print out our response
print(response)

# Parse HTML response
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table on the webpage - we're assuming here that the first table is the one we want to extract.  Python is a 0-indexed language
table = soup.find_all('table')[0] 

# Parse the table into a pandas DataFrame with a MultiIndex header.  The first 2 rows represent the information we'd care about at the column-level
df = pd.read_html(str(table), header=[0, 1])[0]

# Drop the last row
df = df.drop(df.tail(1).index)

# Print the DataFrame
df

<Response [200]>


Unnamed: 0_level_0,Exact age,Male,Male,Male,Female,Female,Female
Unnamed: 0_level_1,Exact age,Death probability a,Number of lives b,Life expectancy,Death probability a,Number of lives b,Life expectancy
0,0,0.005837,100000,74.12,0.004907,100000,79.78
1,1,0.000410,99416,73.55,0.000316,99509,79.17
2,2,0.000254,99376,72.58,0.000196,99478,78.19
3,3,0.000207,99350,71.60,0.000160,99458,77.21
4,4,0.000167,99330,70.62,0.000129,99442,76.22
...,...,...,...,...,...,...,...
115,115,0.800319,0,0.74,0.799516,0,0.74
116,116,0.840335,0,0.68,0.840335,0,0.68
117,117,0.882352,0,0.63,0.882352,0,0.63
118,118,0.926469,0,0.58,0.926469,0,0.58


## Selenium

[Selenium](https://selenium-python.readthedocs.io/) is a powerful tool for controlling web browsers through programs and automating browser tasks. It works across different browsers and platforms and is a key tool for tasks such as web scraping, web testing, and automating repetitive tasks on the web. Selenium supports Python and offers the WebDriver API, which uses browser-native commands to provide a more realistic user experience when interacting with websites during testing. Selenium's ability to integrate with various programming languages, its compatibility with different operating systems, and its support for mobile testing make it a versatile choice for web-based application testing.

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
import pickle

# Read in our file path to the chromedriver
with open('./Data/secret_webdriverpath.pkl', 'rb') as file:
    fpath = pickle.load(file)

# Initialize the Chrome driver
driver = webdriver.Chrome(fpath)

# Go to www.google.com
driver.get('https://www.google.com')

# Find the search bar
search_bar = driver.find_element(By.NAME, 'q')

# Type 'ssa actuarial tables' and hit Enter
search_bar.send_keys('ssa actuarial tables')

time.sleep(3)

search_bar.send_keys(Keys.RETURN)

# Wait for the page to load
time.sleep(3)

# # Wait for the search results to load
# WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div#rso div.g')))

# Click on the first search result link
driver.find_element(By.CSS_SELECTOR, 'div#rso div.g a').click()

# Wait for the page to load
time.sleep(3)

# Identify the underlying html of the page
page_html = driver.page_source

# Close the driver
driver.close()

################################################
# Below should look familiar....
################################################

# Find the table on the webpage - we're assuming here that the first table is the one we want to extract.  Python is a 0-indexed language
table = soup.find_all('table')[0] 

# Parse the table into a pandas DataFrame with a MultiIndex header.  The first 2 rows represent the information we'd care about at the column-level
df = pd.read_html(str(table), header=[0, 1])[0]

# Drop the last row
df = df.drop(df.tail(1).index)

# Print the DataFrame
df

  driver = webdriver.Chrome(fpath)


Unnamed: 0_level_0,Exact age,Male,Male,Male,Female,Female,Female
Unnamed: 0_level_1,Exact age,Death probability a,Number of lives b,Life expectancy,Death probability a,Number of lives b,Life expectancy
0,0,0.005837,100000,74.12,0.004907,100000,79.78
1,1,0.000410,99416,73.55,0.000316,99509,79.17
2,2,0.000254,99376,72.58,0.000196,99478,78.19
3,3,0.000207,99350,71.60,0.000160,99458,77.21
4,4,0.000167,99330,70.62,0.000129,99442,76.22
...,...,...,...,...,...,...,...
115,115,0.800319,0,0.74,0.799516,0,0.74
116,116,0.840335,0,0.68,0.840335,0,0.68
117,117,0.882352,0,0.63,0.882352,0,0.63
118,118,0.926469,0,0.58,0.926469,0,0.58
