## Requirements
* [Chrome Driver](https://chromedriver.chromium.org/downloads)
* Selenium

### Thought Works
* Consider using webdriver-manager to streamline installation of chrome driver

### Proved our Proof of Concept
Now that we are able to successfully scrape information from the website, we want to speed up our implementation. At the moment, a few bottlenecks are:
* We have to wait until the webdriver is fully initialized before we start making request to our page, slowing down our implementation due to the sleep timer I've added. I believe that once the webdriver is initialized, we only have to wait for the webpage to load once for each company. Since there are 531 companies and each company webpage could potentially take 3 seconds to load (based on our current network connection) we'd be waiting a minimum of 27 minutes to acquire our information. This implementation is not robust because we assume a stable network connection. A more robust implementation would wait until the page has fully loaded. This should be implemented in a future iteration.

In [1]:
import bs4
import math
import multiprocessing as mp
import pandas as pd
import requests
import selenium
import time

from bs4 import BeautifulSoup
from collections import namedtuple
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

### Configuration
The cell below contains useful information that is used throughout our web scraping algorithm.

In [2]:
BASE_COMPANY_URL = "https://sloanreview.mit.edu/culture500/company/c"
COMPANY_VALUES = {
    "Agility": "agility",
    "Collaboration": "collaboration",
    "Customer": "customer",
    "Diversity": "inclusivity",
    "Execution": "execution",
    "Innovation": "innovation",
    "Integrity": "integrity",
    "Performance": "performance",
    "Respect": "respect",
}
HEADERS = [
    "frequency_score",
    "sentiment_score",
    "value",
    "company"
]
PATH_TO_CHROME_DRIVER = "/Users/jamosa/bin/chromedriver"
CHROME_OPTIONS = Options()
CHROME_OPTIONS.headless = True
CHROME_OPTIONS.add_argument("--window-size=1920x1080")
COMPANY_START = 101
COMPANY_END = 631
CAPABILITIES = DesiredCapabilities.CHROME.copy()
CAPABILITIES['acceptSslCerts'] = True 
CAPABILITIES['acceptInsecureCerts'] = True

### Fsvc
Fsvc is a named tuple that makes dealing with the information we've scraped and store easier. It'll allow me to access information using highly descriptive names instead of seemingly arbitrary numerical indices.

In [3]:
# Named tuple for hold frequency score, sentiment score, company value, and company name
Fsvc = namedtuple("Fsvc", " ".join(HEADERS))

### partition
Partition quite literally mimicks a mathematical partition. Given some number, a desired number of subsets, and an optional offset, we generate a list of evenly separted values.

In [4]:
def partition(value, num_slices, offset=0):
    """Creates an approximately even partition of some whole number."""
    hop = value // num_slices
    indices = [math.floor(ndx * hop) + offset for ndx in range(num_slices)]
    indices.append(value + offset)
    
    return indices

partition(531, 2, 101)

[101, 366, 632]

### parse_page_by_tag_and_classes
This function is very unique in the sense that it might not be very intuitive, but it serves a purpose that allows me to reduce the amount of code I write. I found myself typing in the following logic multiple times, so to reduce the multiplicity of doing so, also know as being DRY (Don't Repeat Yourself), I extracted the core logic of retrieving text from some tag with specific classes into a function.

In [5]:
def parse_page_by_tag_and_classes(
    scraper : bs4.BeautifulSoup,
    tag : str,
    classes : str
    ) -> str:
    """Returns the text associated with the first instance of `tag` with classes `classes`
    
    :param scraper: an instance of bs4 web scraper already loaded with the document's html
    :param tag: the tage element to search for without the brackets
    :param classes: the classes of interest associated with the tag separated by spaces
    :return: a string of text
    """
    tag_elem = scraper.find(tag, {"class": classes})
    return tag_elem.get_text()

## Company URL
The URL associated with each company is relatively the same. The offset for a companies index is 100, so the first company's information can be found at `https://sloanreview.mit.edu/culture500/company/c101` which is the company `3M`. Therefore, information associated with all companies listed on the Culture 500 page can be found by modifying the last 3 characters of the URL with values in the range of 101 and 631, inclusive.

## Singleton Inspection
So, questions! Now we need to examine a single instance and see if we can successfully scrape information form the site.
* Can we successfully scrape information about a single instance?
    * Yes
* Do all instances follow the same structure?
    * Yes

### retrieve_company_values
This function, given the company's index and a web driver object, get's all of the information from the current company's culture500 page and saves such information to a list of tuples. The information is then returned to the user. This function contains the majority of our logic and required countless experimenting in order to determine the underlying structure of the page.

In [6]:
def retrieve_company_values(
    company_ndx : int,
    browser : selenium.webdriver.chrome.webdriver.WebDriver
    ) -> None:
    """Retrieve information about a company's values.
    
    There are 531 companies listed at `https://sloanreview.mit.edu/culture500/research/#company-list`
    and each company has a numerical value associated with where it is located on their servers. The
    starting index is 101 and subsequent companies, in alphabetical order, follow by simply adding 1
    to the previous index until 632 is reached.
    
    Example company url: `https://sloanreview.mit.edu/culture500/company/c101` --> 3M 
    
    :param company_ndx: the company's numerical value as listed on the site
    """
    # Request web page
    company_info = []
    cur_value = ""
    company_url = f"https://sloanreview.mit.edu/culture500/company/c{company_ndx}"
    browser.get(company_url)
    
    # Needs to wait until graph is loaded - classes associated with graph ==> .sc-bdVaJa .sc-bwzfXH .pjOvZ
    delay = 3
    graph_elem = (
        WebDriverWait(browser, delay)
        .until(EC.visibility_of_element_located(
            (By.CSS_SELECTOR, ".sc-bdVaJa.sc-bwzfXH.pjOvZ")
        ))
    )
    
    # All buttons associated with company values have the same class name
    cv_buttons = browser.find_elements_by_class_name("sc-bdVaJa.sc-EHOje.bubble-sidebar-button.sc-gqjmRU.eiaGtK")
    time.sleep(2)
    for button in cv_buttons:
        # Get current value
        cur_value = button.text
        # Grab all links and simulate a click action
        button.click()
#         time.sleep(.3)
        # Grab the updated pages content
        updated_page_html = browser.page_source
        # Initialize web scraper with current companies html page
        scraper = BeautifulSoup(updated_page_html, "html.parser")
        # Get frequency score and sentiment score
        text = parse_page_by_tag_and_classes(scraper, "div", "sc-bdVaJa sc-htpNat hALkol")
        text = text.replace("Sentiment Score: ", " ").replace("Frequency Score: ", "")
        freq_score, sent_score = text.split(" ")
        # Get company name
        company_name = parse_page_by_tag_and_classes(scraper, "div", "sc-bdVaJa sc-bwzfXH jruoDg")
        # Add information to container
        fsvc = Fsvc(freq_score, sent_score, cur_value, company_name)
        company_info.append(fsvc)
    
    return company_info

## Next Steps - Gotta Catch 'Em All
Now that we are able to successfully scrape information about any one company, we introduce a loop into our program to get information about all companies. This part also took some thinking outside of the box. We are able to seamlessly scrape information from culture500, but the window popping up became rather bothersome and quite unneccessary. So, insteading of having our browser (Chrome, in this case) pop up for each company, we modified our configuration so that chrome can run in a headless manner.

In [7]:
# Setup webdriver
chrome_driver = webdriver.Chrome(
    executable_path=PATH_TO_CHROME_DRIVER,
    options=CHROME_OPTIONS,
    desired_capabilities=CAPABILITIES
)

# Create dataframe to save information to
company_values_df = pd.DataFrame(columns=HEADERS)

# Get information about all companies
for ndx in range(COMPANY_START, COMPANY_END+1):
    list_of_values = retrieve_company_values(ndx, chrome_driver)
    cur_company_values_df = pd.DataFrame.from_records(list_of_values, columns=Fsvc._fields)
    company_values_df = pd.concat([company_values_df, cur_company_values_df], axis=0)
    print(f"Company Name: {list_of_values[0].company} -- Status: Information Retrieved...")
    
# At this point we've successfully scraped all of the information
# Let's close our connection and save our information to a csv file
chrome_driver.quit()
with open("culture500_data.csv", "a") as culture500_file:
    culture500_file.write(company_values_df.to_csv(index=False))

Company Name: 3M -- Status: Information Retrieved...
Company Name: Anheuser-Busch InBev -- Status: Information Retrieved...
Company Name: ABB -- Status: Information Retrieved...
Company Name: Abbott Laboratories -- Status: Information Retrieved...
Company Name: AbbVie -- Status: Information Retrieved...
Company Name: Abercrombie & Fitch -- Status: Information Retrieved...
Company Name: Accenture -- Status: Information Retrieved...
Company Name: Adobe -- Status: Information Retrieved...
Company Name: ADP -- Status: Information Retrieved...
Company Name: AdventHealth -- Status: Information Retrieved...
Company Name: Advisory Board -- Status: Information Retrieved...
Company Name: Advocate Health Care -- Status: Information Retrieved...
Company Name: Aecom -- Status: Information Retrieved...
Company Name: Aéropostale -- Status: Information Retrieved...
Company Name: Aetna -- Status: Information Retrieved...
Company Name: Aflac -- Status: Information Retrieved...
Company Name: AIG -- Statu

Company Name: Comcast -- Status: Information Retrieved...
Company Name: Comerica -- Status: Information Retrieved...
Company Name: CompuCom -- Status: Information Retrieved...
Company Name: Conagra Brands -- Status: Information Retrieved...
Company Name: Conduent -- Status: Information Retrieved...
Company Name: ConocoPhillips -- Status: Information Retrieved...
Company Name: Convergys -- Status: Information Retrieved...
Company Name: Costco Wholesale -- Status: Information Retrieved...
Company Name: Covidien -- Status: Information Retrieved...
Company Name: Cox Communications -- Status: Information Retrieved...
Company Name: Credit Suisse -- Status: Information Retrieved...
Company Name: CSAA Insurance Group -- Status: Information Retrieved...
Company Name: CSC -- Status: Information Retrieved...
Company Name: Cummins -- Status: Information Retrieved...
Company Name: CVS Health -- Status: Information Retrieved...
Company Name: Dairy Queen -- Status: Information Retrieved...
Company Na

Company Name: Infor -- Status: Information Retrieved...
Company Name: Infosys -- Status: Information Retrieved...
Company Name: Ingersoll Rand -- Status: Information Retrieved...
Company Name: Ingram Micro -- Status: Information Retrieved...
Company Name: Instacart -- Status: Information Retrieved...
Company Name: Intel Corporation -- Status: Information Retrieved...
Company Name: InterContinental Hotels Group -- Status: Information Retrieved...
Company Name: Intermountain Healthcare -- Status: Information Retrieved...
Company Name: Intuit -- Status: Information Retrieved...
Company Name: Illinois Tool Works -- Status: Information Retrieved...
Company Name: J. C. Penney -- Status: Information Retrieved...
Company Name: J.Crew -- Status: Information Retrieved...
Company Name: Johnson & Johnson -- Status: Information Retrieved...
Company Name: Jabil -- Status: Information Retrieved...
Company Name: Jack Henry & Associates -- Status: Information Retrieved...
Company Name: Jack in the Box 

TimeoutException: Message: 
