<a href="https://colab.research.google.com/github/G1useppe/ma5851_capstone_T0CE/blob/main/WEBCRAWLER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.1 Domains

## 2.1.1 Target

The webscraper is intended to target Indeed Australia at au.indeed.com. Due to the fact that most employers cross-post to many job board services, only one target was chosen to avoid duplication. Indeed was chosen over industry competitors, Seek and Jora for two primary reasons. The first of these is that in preliminary investigations, it was found that Indeed will return up to 915 advertisements on a single search, whereas Seek and Jora would only provide 500 and 600 respectively. Secondly, Indeed has standardized fields where advertisers are coerced to provide sensible values. The best example of this is that Indeed forces advertisers to declare salary as either a fixed amount or a range, and then accompany this with a pay frequency (an hour, a day, a year). Seek and Jora's negligence to uphold some sort of standardization allowed values such as 'Competitive Salary' to be considered appropriate values for the Salary field. 

## 2.1.2 Domain Limitations

The primary limitation faced in the Indeed domain is the low proportion of job postings that had an attached salary. Whilst the non-salaried data has bias assessment applications, there is still a lot of data that will not be beneficial to the machine learning pipeline. Furthermore, not all jobs get listed on advertising boards in the contemporary environment of professional social networking (LinkedIn etc.), so whilst the web scraper should provide a sound picture of the employment market for Data Scientists, there may be some gaps in the overall picture.

## 2.1.3 Data Alignment

The data aligned to our investigation to be extracted is as follows:
1.   Salary. This is intended to be the target variable for the machine learning component.
2.   Location data (State, City). This is intended to capture the variability represented by the location in which people work. It is expected that economic hubs such as Melbourne and Sydney attract higher salaries due to the associated cost of living.
3.  Text data (Job Title, Company, Description). This data will form the nulk of the feature set following the data wrangling and feature extraction. It is expected that most of the signal in the dataset will come from these features, particularly those extracted from the Description field, which is expected to contain a rich corpus. 

## 2.1.4 Copyright and Legal Considerations

Several lines of communication were opened with Indeed to seek express approval to conduct this work. Unofortunately, departments were reluctant to approach the request and, in most cases deferred the query to another department. A firm resolution on the matter could not be achieved in the time frame for this work. There is a sizeable body of work to do with data collection from Indeed among non-academic data science communitites to suggest that there is no active motion from Indeed to suppress projects that do not make a commerical gain. This sentiment is in line with the Indeed Terms of Use.

# 2.2 Webcrawler Workflow

## 2.2.1 Technology Components

The technology used for the web scraper is Selenium (more specifically selenium.webdriver), first released by author Baiju Muthukadan in 2011. It is primarily used for automating web applications and in this work, this ability is harnessed by pairing selenium with a Google Chrome driver. This combination provides the capability to complete tasks as a human would, but with much greater speed due to automation. The tasks Selenium performs in this work are:

*   Clicking buttons
*   Passing information to input fields
*   Searching for HTML elements
*   Scraping these elements into a database







In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

import pandas as pd
import numpy as np

## 2.2.2-3 Domain Complexity and Methodologies

The workflow contains three major processes, and these have separate domain complexities. The first of these is to generate a search for Data Science postings. A consistent aspect of the Indeed HTML architecture is the provision of unique class names, which lowers the domain complexity and makes elements easy to locate. In the case of generating the search, the required elements are easily found due to the HTML coding practices.
The beauty of Selenium is that it can be used to mirror human workflows, and many functions are based on these primitive browser actions. To generate the search, two actions must be performed.

1.   Pass the search term to the keywords field
2.   Click the search button

To complete these processes, the element must be found, and the action must be ordered. Selenium has a range of locator methods to be used in conjunction with find_element(), and the predominant method in this work is By.XPATH, where the element type, and an element feature are passed to find matching elements in the HTML. In this case, arguments are generated to find the required elements. send_keys() and click() are passed to the discovered elements to complete



The below code block implements the find_elements() function from selenium.webdriver, where all instances of the search are returned. In this case, the search parameter is By.XPATH, a versatile and powerful locator. The element type ('a'), and the element class ('jcs-JobTitle css-jspxzf eu4oa1w0') are passed to the function, matching values are returned. Using list comprehension, these values can be added to the master list for later querying. This method does pick up some URLs that don't correspond to job listings, and these are filtered in the following code block. 

In [None]:
DRIVER_PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(DRIVER_PATH)

In [None]:
base_url = 'http://au.indeed.com'
driver.get(base_url)

In [None]:
keywords_field_arg = '//input[@class="icl-TextInput-control icl-TextInput-control--withIconRight"]'
keywords_field = driver.find_element(By.XPATH, keywords_field_arg)
keywords_field.send_keys("Data Science")

initial_search_button = driver.find_element(By.XPATH, "//button[normalize-space()='Find jobs']")
initial_search_button.click()

The second process is to collate a group of URLs corresponding to the jobs listings to later query. Much like the first process, the domain complexity is mitigated by the fact that each job listing URL is assigned the same class.

The below code blocks implement the find_elements() function from selenium.webdriver, where all instances of the search are returned. Instead of actioning any of these URLs, the intention is to collect the data encased in the element by calling get_attribute(). The helper function to drive the webscraper to the next page once it has exhausted the element finding uses the By.CSS_SELECTOR as the locator method, which behaves slightly differently to By.XPATH, but accepts a similar argument to target some elements attributes not captured by XPATH. To locate the URLs, By.XPATH is employed. The element type ('a'), and the element class ('jcs-JobTitle css-jspxzf eu4oa1w0') are passed to the function, matching values are returned. Using list comprehension, these values can be added to the master list for later querying. This method does pick up the link for the sign in button on each of the 60 pages, however these are filtered in the subsequent code block. This method was evaluated to be optimal as it only makes one call for data collection to the HTML code per page.

In [None]:
def next_page():
    next_page_url = driver.find_element(By.CSS_SELECTOR, "a[aria-label='Next']").get_attribute('href')
    driver.get(next_page_url)

In [None]:
job_urls = []
for page in range(60):
    elems = driver.find_elements(By.XPATH, "//a[@class='jcs-JobTitle css-jspxzf eu4oa1w0']")
    links = [elem.get_attribute('href') for elem in elems]
    for link in links:
        job_urls.append(link)
    next_page()

In [None]:
for url in job_urls:
    if url[0:27] != 'https://au.indeed.com/rc/cl':
        job_urls.remove(url)

len(job_urls)

826

In [None]:
def drop_https(url):
    return url[0:4]+url[5:] #return all but the fifth character
drop_https("https://www.afl.com.au")

'http://www.afl.com.au'

The third process is the individual querying of the HTML in each URL obtained in the previous step. This process is a better case to explain the complexity of the domain, which was found to be very approachable due to the HTML coding practices in place. The concept that makes Indeed relatively easy to scrape is the consistent provision of classes to elements in the code. 

The only domain complexity related challenge was encountered in this step. This was initial difficulty in determining a vector to separate the company and city data, as the class was the same for both containers. Eventually, the class for the container in which both pieces of data was encapsulated, and the issue can easily be mitigated in the data wrangling component of this work.

In [None]:
job_title_arg = "//h1[@class='icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title']"
company_city_arg = "//div[@class='icl-u-xs-mt--xs icl-u-textColor--secondary jobsearch-JobInfoHeader-subtitle jobsearch-DesktopStickyContainer-subtitle']"
salary_arg = "//span[@class='icl-u-xs-mr--xs']"
description_arg = "//div[@class='jobsearch-jobDescriptionText']"

In [None]:
unstructured_data = []
for url in job_urls:
    record = [] #start a blank record
    driver.get(drop_https(url))
    try: #to get the company and city
        company_city = driver.find_element(By.XPATH, company_city_arg).get_attribute('innerText')
        record.append(company_city)
    except:
        record.append(np.nan)
    try: #to get the job title
        job_title = driver.find_element(By.XPATH, job_title_arg).get_attribute('innerText')
        record.append(job_title)
    except:
        record.append(np.nan)
    try: #to get salary
        salary = driver.find_element(By.XPATH, salary_arg).get_attribute('innerText')
        record.append(salary)
    except:
        record.append(np.nan)   
    try: #to get description
        description = driver.find_element(By.XPATH, description_arg).get_attribute('innerText')
        record.append(description)
    except:
        record.append(np.nan)   
        
    unstructured_data.append(record)    

str_data = pd.DataFrame(unstructured_data, columns = ['Company_City', 'Job Title', 'Salary', 'Description'])
   

## 2.2.4 Data Storage

As demonstrated in the above code block, a list of lists was created to store the data primitively, and this was converted to a pandas data frame.

# Output

In [None]:
str_data.to_excel("output.xlsx")  

Checkpointing served as an essential procedure in this work given the large scope of the project. Converting the data to an Excel file allowed a robust, offline copy of the generate data to both protect the progress made, and to pass the data into a different workbook for the data processing component.