* [Web Scraping For Beginners with Python](https://medium.com/@durgaswaroop/web-scraping-with-python-introduction-7b3c0bbb6053)
* [Web Scraping in Python](https://medium.com/dreidev/web-scraping-in-python-e07fba0a1663)
* [Web Scraping](https://medium.com/tag/web-scraping)
* [Web Scraping Tutorial with Python: Tips and Tricks](https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071)
* [SQLite Python tutorial](http://zetcode.com/db/sqlitepythontutorial/)
* [Better web scraping in Python with Selenium, Beautiful Soup, and pandas](https://medium.freecodecamp.org/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251)
* [Locating Elements within Selenium](https://selenium-python.readthedocs.io/locating-elements.html#locating-by-id)
* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
* [1MDB](https://en.wikipedia.org/wiki/1Malaysia_Development_Berhad)
* [HOW TO RUN WEB DRIVERS WITH PROXIES IN PYTHON](https://johnpatrickroach.com/2017/03/31/how-to-run-web-drivers-with-proxies-in-python/)
* [Automatic news scraping with Python, Newspaper and Feedparser
](https://holwech.github.io/blog/Automatic-news-scraper/)
* [Web Scraping in Python using Scrapy (with multiple examples)](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/)
* [Scraping a JS-Rendered Page](http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_a_Webpage_Rendered_by_Javascript_Using_Python.php)
* [How to Scrape Javascript Rendered Websites with Python & Selenium](https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa)
* [5 Simple Tips for Efficient Web Crawling using Selenium Python](https://medium.com/dreamcatcher-its-blog/5-simple-tips-for-improving-automated-web-testing-or-efficient-web-crawling-using-selenium-python-43038d7b7916)
* [Scraping the full content from a lazy-loading webpage](https://codereview.stackexchange.com/questions/167327/scraping-the-full-content-from-a-lazy-loading-webpage)
* [Headless Chrome in AWS](https://robertorocha.info/setting-up-a-selenium-web-scraper-on-aws-lambda-with-python/)
* [How To Make Your Selenium Scripts Faster](https://www.linkedin.com/pulse/how-make-your-selenium-tests-faster-alex-siminiuc/)

# Selenium

While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. 

In [1]:
#!pip3 install tabulate

In [19]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
from tabulate import tabulate
import os

#launch url
url = 'http://kanview.ks.gov/PayRates/PayRates_Agency.aspx'

chrome_path = os.getcwd() + '/chromedriver'

In [20]:
# create a new Chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_path,
                          chrome_options=chrome_options)
driver.implicitly_wait(30)
driver.get(url)

In [21]:
# get the inner html as string
#innerHTML = driver.execute_script("return document.body.innerHTML") #returns the inner HTML as a string

In [22]:
#After opening the url above, Selenium clicks the specific agency link
python_button = driver.find_element_by_id('MainContent_uxLevel1_Agencies_uxAgencyBtn_33') #FHSU

python_button.click() #click fhsu link

#Selenium hands the page source to Beautiful Soup
soup_level1=BeautifulSoup(driver.page_source, 'lxml')

In [23]:
datalist = [] #empty list
x = 0 #counter

#Beautiful Soup finds all Job Title links on the agency page and the loop begins
for link in soup_level1.find_all('a', 
                                 id=re.compile("^MainContent_uxLevel2_JobTitles_uxJobTitleBtn_")):
    
    #Selenium visits each Job Title page
    python_button = driver.find_element_by_id('MainContent_uxLevel2_JobTitles_uxJobTitleBtn_' + str(x))
    python_button.click() #click link
    
    #Selenium hands of the source of the specific job page to Beautiful Soup
    soup_level2=BeautifulSoup(driver.page_source, 'lxml')

    #Beautiful Soup grabs the HTML table on the page
    table = soup_level2.find_all('table')[0]
    
    #Giving the HTML table to pandas to put in a dataframe object
    df = pd.read_html(str(table),header=0)
    
    #Store the dataframe in a list
    datalist.append(df[0])
    
    #Ask Selenium to click the back button
    driver.execute_script("window.history.go(-1)") 
    
    #increment the counter variable before starting the loop over
    x += 1
    
    if x > 3 :
        break
    
    #end loop block
    
#loop has completed

#end the Selenium browser session
driver.quit()

#combine all pandas dataframes in the list into one big dataframe
result = pd.concat([pd.DataFrame(datalist[i]) for i in range(len(datalist))],ignore_index=True)

#convert the pandas dataframe to JSON
json_records = result.to_json(orient='records')

#pretty print to CLI with tabulate
#converts to an ascii table
print(tabulate(result, 
               headers=["Employee Name","Job Title","Overtime Pay","Total Gross Pay"],
               tablefmt='psql'))

+----+---------------------+--------------------------------+----------------+-------------------+
|    | Employee Name       | Job Title                      | Overtime Pay   | Total Gross Pay   |
|----+---------------------+--------------------------------+----------------+-------------------|
|  0 | Brown,Michelle N    | Acad Adv Career Explr Asst Dir | $0.00          | $44,011.11        |
|  1 | Griffin,Patricia L  | Academic Adv Career Explor Dir | $0.00          | $78,350.21        |
|  2 | Armstrong,Micki A   | Academic Advisor               | $0.00          | $45,587.69        |
|  3 | Fisher,Erica A      | Academic Advisor               | $0.00          | $38,099.14        |
|  4 | Fitzhugh,Nanette J  | Academic Advisor               | $0.00          | $47,923.12        |
|  5 | Hepner,Kristine R   | Academic Advisor               | $0.00          | $33,616.20        |
|  6 | Johnson,Stephanie A | Academic Advisor               | $0.00          | $35,144.78        |
|  7 | Lei

In [24]:
#open, write, and close the file
f = open(path + "/fhsu_payroll_data.json","w") #FHSU
f.write(json_records)
f.close()
# sublime: control + cmd + j

# Modules & Import files 

In [9]:
import sys
sys.path

['',
 '/anaconda3/lib/python36.zip',
 '/anaconda3/lib/python3.6',
 '/anaconda3/lib/python3.6/lib-dynload',
 '/anaconda3/lib/python3.6/site-packages',
 '/anaconda3/lib/python3.6/site-packages/aeosa',
 '/anaconda3/lib/python3.6/site-packages/spynner-2.19-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/autopy-1.0.1-py3.6-macosx-10.7-x86_64.egg',
 '/anaconda3/lib/python3.6/site-packages/pyquery-1.4.0-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/unittest2-1.1.0-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/traceback2-1.4.0-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/argparse-1.4.0-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/linecache2-1.0.0-py3.6.egg',
 '/anaconda3/lib/python3.6/site-packages/IPython/extensions',
 '/Users/song/.ipython']

#### sys will check the default path as above for modules we wanna import
* 1.Modules in the wrong path: mymodule in folder 'modules', ModuleNotFoundError
* 2.Next let's try to manipulate the directories

In [3]:
import mymodule

ModuleNotFoundError: No module named 'mymodule'

NameError: name 'mymodule' is not defined

In [8]:
from pprint import pprint

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

chrome_path = r'/Users/song/GoogleDrive_SMU/MAS/Alternative Data/WebScrapping/chromedriver'

class Fortune500Scraper:
    def __init__(self, chrome_path):
        self.driver = webdriver.Chrome(chrome_path)
        self.wait = WebDriverWait(self.driver, 10)

    def get_last_line_number(self):
        """Get the line number of last company loaded into the list of companies."""
        return int(self.driver.find_element_by_css_selector("ul.company-list > li:last-child > a > span:first-child").text)

    def get_links(self, max_company_count=10):
        """Extracts and returns company links (maximum number of company links for return is provided)."""
        self.driver.get('http://fortune.com/fortune500/list/')
        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.company-list")))

        last_line_number = 0
        while last_line_number < max_company_count:
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            self.wait.until(lambda driver: self.get_last_line_number() != last_line_number)
            last_line_number = self.get_last_line_number()

        return [company_link.get_attribute("href")
                for company_link in self.driver.find_elements_by_css_selector("ul.company-list > li > a")]

    def get_company_data(self, company_link):
        """Extracts and prints out company specific information."""
        self.driver.get(company_link)

        return {
            row.find_element_by_css_selector(".company-info-card-label").text: row.find_element_by_css_selector(".company-info-card-data").text
            for row in self.driver.find_elements_by_css_selector('.company-info-card-table > .columns > .row')
        }

if __name__ == '__main__':
    scraper = Fortune500Scraper(chrome_path)

    company_links = scraper.get_links(max_company_count=100)
    for company_link in company_links:
        company_data = scraper.get_company_data(company_link)
        pprint(company_data)
        print("------")

{'CEO': 'C. Douglas McMillon',
 'CEO Title': 'President, Chief Executive Officer & Director',
 'Employees': '2,300,000',
 'HQ Location': 'Bentonville, Ark.',
 'Industry': 'General Merchandisers',
 'Sector': 'Retailing',
 'Website': 'www.stock.walmart.com',
 'Years on Fortune 500 List': '24'}
------
{'CEO': 'Darren W. Woods',
 'CEO Title': 'Chairman & Chief Executive Officer',
 'Employees': '71,200',
 'HQ Location': 'Irving, Texas',
 'Industry': 'Petroleum Refining',
 'Sector': 'Energy',
 'Website': 'www.exxonmobil.com',
 'Years on Fortune 500 List': '24'}
------
{'CEO': 'Warren E. Buffett',
 'CEO Title': 'Chairman, President & Chief Executive Officer',
 'Employees': '377,000',
 'HQ Location': 'Omaha',
 'Industry': 'Insurance: Property and Casualty (Stock)',
 'Sector': 'Financials',
 'Website': 'www.berkshirehathaway.com',
 'Years on Fortune 500 List': '24'}
------
{'CEO': 'Timothy D. Cook',
 'CEO Title': 'Chairman & Chief Executive Officer',
 'Employees': '123,000',
 'HQ Location': 'Cu

TimeoutException: Message: timeout
  (Session info: chrome=68.0.3440.106)
  (Driver info: chromedriver=2.41.578706 (5f725d1b4f0a4acbf5259df887244095596231db),platform=Mac OS X 10.13.6 x86_64)
