# `Selenium Webscraping Indeed Job Postings - July 2023`

# <font color=red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# `Purpose & Outcome:`

+ Webscrape Indeed Postings
+ Methods, drawbacks and suggestions
+ Speeding up code and downsides with this method!

# `What is Selenium and how is it used?`

+ When you need to do unit testing, automation or assistance when webscraping this is a tool to aid you.
    + Great for clicking buttons
    + drop-down menus
    + acting/emulating human interactions on a webpage
  
+ `You can use Selenium as a webscraper but, its not fast and will help if you are in a pinch`

In [1]:
# Install if you have never used these: unblock the lines below to install if needed

# !pip install webdriver-manager
# !pip3 install lxml
# !pip3 install selenium
# !pip3 install webdriver_manager
# !pip install --upgrade pip
# !pip install -U selenium

In [2]:
# --------- import necessary modules -------

# For webscraping
from bs4 import BeautifulSoup

# Parsing and creating xml data
from lxml import etree as et

# Store data as a csv file written out
from csv import writer

# In general to use with timing our function calls to Indeed
import time

# Assist with creating incremental timing for our scraping to seem more human
from time import sleep

# Dataframe stuff
import pandas as pd

# Random integer for more realistic timing for clicks, buttons and searches during scraping
from random import randint

# Multi Threading
import threading

# Threading:
from concurrent.futures import ThreadPoolExecutor, wait

In [3]:
import selenium

# Check version I am running
selenium.__version__

'4.10.0'

In [653]:
# Selenium 4:

from selenium import webdriver

# Starting/Stopping Driver: can specify ports or location but not remote access
from selenium.webdriver.chrome.service import Service as ChromeService

# Manages Binaries needed for WebDriver without installing anything directly
from webdriver_manager.chrome import ChromeDriverManager

In [655]:
# Allows searchs similar to beautiful soup: find_all
from selenium.webdriver.common.by import By

# Try to establish wait times for the page to load
from selenium.webdriver.support.ui import WebDriverWait

# Wait for specific condition based on defined task: web elements, boolean are examples
from selenium.webdriver.support import expected_conditions as EC

# Used for keyboard movements, up/down, left/right,delete, etc
from selenium.webdriver.common.keys import Keys

# Locate elements on page and throw error if they do not exist
from selenium.common.exceptions import NoSuchElementException

# `Consider Headless Browser: speed up & uses less resources`

There are some condiserations though:

+ Some browsers create issues
+ debugging can be tricky
+ you may have limited plugin usage or support
+ you are not able to see visually how the website or application are working 

`-------------------------------------------------`

# `from selenium.webdriver.common.by import By`

Think of this as being similar to using `Beautiful Soup and find_all`
+ when used it allows you to find something within an HTML document, if it fails you raise the exception: `NoSuchElementException`
+ **`Becareful when using BY`** because if this is not a static page then any attrubutes you are searching can become an error in the future when it fails.
    + For example if you are searching by `Class` this can create issues later vs using
        + This is because it is a `CSS` selector and can change overtime since it is an attribute
    + `ID` which may make your code more robust! This CAN be a unique identifier that may help you instead

# `NoSuchElementException`

This is useful to locate elements within a page while loading and try to handle exceptions.
+ During `AJAX` calls you may have issues if the application was build using `React, VUE, Angular` and require different use cases to make the above checks. [article to explain](https://reflect.run/articles/everything-you-need-to-know-about-nosuchelementexception-in-selenium/) and you can consider polling.

`-------------------------------------------------`

# `Other Common Errors:`

+ **`InvalidSelectorException`**

+ **`ElementNotInteractableException`**

+ **`TimeoutException`**

In [6]:
# Allows you to cusotmize: ingonito mode, maximize window size, headless browser, disable certain features, etc
option= webdriver.ChromeOptions()

# Going undercover:
option.add_argument("--incognito")


# # Consider this if the application works and you know how it works for speed ups and rendering!

# option.add_argument('--headless=chrome')


In [422]:
# Define job and location search keywords
job_search_keyword = ['Data+Scientist', 'Business+Analyst', 'Data+Engineer', 
                      'Python+Developer', 'Full+Stack+Developer', 
                      'Machine+Learning+Engineer']

# Define Locations of Interest
location_search_keyword = ['New+York', 'California', 'Washington']

# Finding location, position, radius=35 miles, sort by date and starting page
paginaton_url = 'https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date&start={}'

# print(paginaton_url)

https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date&start={}


# `Things to consider when scraping data:`

+ Wait for page to load before we start running tasks
+ make sure what we are looking for is actually there
    + It can be absent
    + hidden in DOM, iframe or similar
+ timing our calls to remain more like an average user
+ Exception handling

`----------------------------------------------`

# `I/O vs CPU Bound:`

**`During webscraping tasks you are I/O bound!`** you are making calls to retreive `HTML`. Try to avoid unnecessary calls which may get your IP Address blocked like I have many times. [CPU, I/O article](https://testdriven.io/blog/concurrency-parallelism-asyncio/)

+ **`Multi-Threading:`** `concurrent`
    + Your tasks will not run parrallel here and they run one after another. 
    + If something is waiting or slow it can start working on another task and will be asynchronous
        + Meaning that you can have tasks out of order and not 1-2-3 but maybe 0-2-3-1 for example of order
        + This can occur due to Network or I/O operations
+ **`Multi-Processing:`** `parrallel`

+ **`AsyncIO:`** benefit of threading but not worrying about wait times and running more tasks during a wait time.
This is a step above the threading from above but requires more code and thought to setup.

`----------------------------------------------`

+ **`Asynchronous:`** think of running one task and then calling the next task before the first task has finished. This happens when you send a response but don't receive an answer so you go to the next person in line and when you are free and have a response from prior person you then go ahead and help them. Essentially, lowering the idle time of waiting for a response. [Good breakdown and visuals](https://medium.com/analytics-vidhya/asynchronous-web-scraping-101-fetching-multiple-urls-using-arsenic-ec2c2404ecb4)

# `Let's look at what is going on below:`

There are some concerns and things for you to consider:

1.) Below is the MAX number of jobs you will find for a posting of interest!

2.) This is not an accurate depiction because you can have way less than this depending on results
    
    + An issue arises due to duplicated listings

3.) Pagination is difficult to do and when to stop the search results

4.) you have a filter option (&filter=0, &filter=1), filter =1 shows non-duplicates which reduces results but you need to figure out how to do pagination!

`---------------------------------------------------------------`

# `First let's try to find number of jobs for a given posting`


In [650]:
start = time.time()


job_='Data+Engineer'
location='Washington'

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option)


driver.get(paginaton_url.format(job_,location,0))

# t = ScrapeThread(url_)
# t.start()

sleep(randint(2, 6))

p=driver.find_element(By.CLASS_NAME,'jobsearch-JobCountAndSortPane-jobCount').text

# Max number of pages for this search! There is a caveat described soon
max_iter_pgs=int(p.split(' ')[0])//15 


driver.quit() # Closing the browser we opened


end = time.time()

print(end - start,'seconds to complete action!')
print('-----------------------')
print('Max Iterable Pages for this search:',max_iter_pgs)


7.784760236740112 seconds to complete action!
-----------------------
Max Iterable Pages for this search: 11


 
`----------------------------------------------------------` 
 
# Notes for this project:

+ Filling in forms:
+ click buttons
+ possible human detection stuff

**`Xpath vs CSS selectors for retreiving data`**

+ `Xpath:` bidirectional (can go from parent to child and reverse) traversal
    + slower retrevial speed
    + text functions supported
    + pay attention to relative '//' and absolute path '/' notations
    + Think of a tree like structure to breakdown
+ `CSS:` directional (parent to child only)

`------------------------`

**`Xpath`**
+ *`Xpath`* stands for `XML Path` which is a query language used to find the path of an element in XML documents
+ Essentially you are navigating a `DOM` 
+ More flexible than using `CSS`
    + If you don't know the name of an element you can use `contains` as your key word which is great!
 
**`CSS`**
+ Most often the HTML will be styled in a cascading format and identifying elements will come from the `Class` they fall within
+ They are used to select various elements within a `DOM`
    + **`Simple selectors:`** such as finding a `Class` or `ID`
    + **`Attribute selectors:`** 
    + **`Pseudo selectors:`** such as hover boxes or check boxes as examples
    
# `Wait times: ` because of how webpages are rendered you will/can have various items loading at different times. 
This can be a problem when you are webscraping. If you try to grab the elements too fast you can miss something or 
cause errors to occur which could have been avoided. 

Ways to combat this can include explicit waits within Selenium such as [selenium doc](https://selenium-python.readthedocs.io/waits.html) 

`from selenium.webdriver.support.wait import WebDriverWait`

`from selenium.webdriver.support import expected_conditions as EC`

`----------------------------------------------------------------------`

In [651]:
# Pagination: PRACTICE

start = time.time()


job_='Data+Engineer'
location='Washington'


job_lst=[]
job_description_list_href=[]

# job_description_list = []
salary_list=[]


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option)
sleep(randint(2, 6))

# driver.get("https://www.indeed.com/q-USA-jobs.html")

for i in range(0,max_iter_pgs):
    driver.get(paginaton_url.format(job_,location,i*10))
    
    
    sleep(randint(2, 4))

    job_page = driver.find_element(By.ID,"mosaic-jobResults")
    jobs = job_page.find_elements(By.CLASS_NAME,"job_seen_beacon") # return a list

    for jj in jobs:
        job_title = jj.find_element(By.CLASS_NAME,"jobTitle")
#         print(job_title.text)
        
# Href's to get full job description (need to re-terate to get full info)
# Reference ID for each job used by indeed         
# Finding the company name        
# Location
# Posting date
# Job description

        job_lst.append([job_title.text,
        job_title.find_element(By.CSS_SELECTOR,"a").get_attribute("href"),
        job_title.find_element(By.CSS_SELECTOR,"a").get_attribute("id"),      
        jj.find_element(By.CLASS_NAME,"companyName").text,       
        jj.find_element(By.CLASS_NAME,"companyLocation").text,
        jj.find_element(By.CLASS_NAME,"date").text,
        job_title.find_element(By.CSS_SELECTOR,"a").get_attribute("href")])
        

        try: # I removed the metadata attached to this class name to work!
            salary_list.append(jj.find_element(By.CLASS_NAME,"salary-snippet-container").text)

        except NoSuchElementException: 
            try: 
                salary_list.append(jj.find_element(By.CLASS_NAME,"estimated-salary").text)
                
            except NoSuchElementException:
                salary_list.append(None)
      
                
#         # Click the job element to get the description
#         job_title.click()
        
#         # Help to load page so we can find and extract data
#         sleep(randint(3, 5))

#         try: 
#             job_description_list.append(driver.find_element(By.ID,"jobDescriptionText").text)
            
#         except: 
            
#             job_description_list.append(None)

driver.quit() 


end = time.time()

print(end - start,'seconds to complete Query!')

# alternate way to grab the info for job description to make it faster:


75.44914388656616 seconds to complete Query!


In [660]:
job_lst[0:2]

[['Data Engineer- W2 and onsite-(No C2C Candidates)',
  'https://www.indeed.com/company/Aalpha-Tech-Global/jobs/Data-Engineer-0a3b693beb4513a1?fccid=446617e52fa5726d&vjs=3',
  'job_0a3b693beb4513a1',
  'Aalpha Tech Global',
  'Seattle, WA 98101 \n(Downtown area)',
  'Posted\nJust posted',
  'https://www.indeed.com/company/Aalpha-Tech-Global/jobs/Data-Engineer-0a3b693beb4513a1?fccid=446617e52fa5726d&vjs=3'],
 ['Data and Analytics Engineer - Senior Associate',
  'https://www.indeed.com/rc/clk?jk=f649d5700cf3da4d&fccid=5e964c4afc56b180&vjs=3',
  'job_f649d5700cf3da4d',
  'PRICE WATERHOUSE COOPERS',
  'Seattle, WA 98101 \n(Downtown area)',
  'Posted\nToday',
  'https://www.indeed.com/rc/clk?jk=f649d5700cf3da4d&fccid=5e964c4afc56b180&vjs=3']]

In [658]:
salary_list[0:3]

['Estimated $101K - $128K a year',
 'Estimated $117K - $148K a year',
 'Estimated $114K - $145K a year']

# `Here is a side note:`


+ This gives me an error because it was code from the past version:

`driver = webdriver.Chrome(ChromeDriverManager().install())`


+ `When using ingonito browser:` your browsing tabs will pull different data than a normal window. Understand this when doing your troubleshooting and debugging. If you have a window open to find your tags but parse in a different type of window the results will not line up.

+ Also, when you are grabbing `job descriptions` for example you will need to time it so the page will read the data after it is loaded. If you immediately try to grab data you may not get everything!
    + Option 1: use the clickable tab from the `job title` then scrape directly
    + Option 2: consider saving the `HREF's` and then doing a separate parsing in a different function. This I think may be faster. But, check for yourself.
    
+ To speed things up consider `headless browser` but, understand the debugging becomes an issue!

+ **If you parse a good amount of pages** you will encounter a checkbox that needs to be clicked to show you are not a robot. This occurs to me usually after 15-30 pages of scraping which is not a lot. (I need to figure this out)
    + Option 1: try to see if you can pull the information for this button to scrape it directly and click
    + Option 2: reset and tinker with the settings of timing out, sleep settings and maybe error handling

**`Big Concern: Pagination`**
When you need to go from page to page sequentially this is not straight forward. Practice and a lot of reading will aid you. I am not savvy just yet.
+ Clickable buttons and learning how to use them and WHEN TO STOP iterating are NOT trivial tasks
+ Hacking your way through, such I did for this example but, there is a glaring issue with duplicate entries.
+ Finding hidden elements and figuring out how to extract them.

# `Option 1 Find Description Links From Beginning:`

In [525]:


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option)
sleep(randint(2, 6))


for i in range(0,max_iter_pgs):
    driver.get(paginaton_url.format(job_,location,i*10))
    
    sleep(randint(2, 4))

    job_page = driver.find_element(By.ID,"mosaic-jobResults")
    jobs = job_page.find_elements(By.CLASS_NAME,"job_seen_beacon") # return a list

    for jj in jobs:
        job_title = jj.find_element(By.CLASS_NAME,"jobTitle")

                
        # Click the job element to get the description
        job_title.click()
        
        # Help to load page so we can find and extract data
        sleep(randint(3, 5))

        try: 
            job_description_list.append(driver.find_element(By.ID,"jobDescriptionText").text)
            
        except: 
            
            job_description_list.append(None)
driver.quit()
# job_description_list[-17:-1]

['Senior Software Engineer\nJoin an expert team that is breaking records in real-time Big Data performance\nChange the way the world manipulates and analyzes large quantities of data\nAddress our customer’s data pain points and delight them with your solutions\nSpaceCurve is building Big Data analytic solutions focused on spatial, temporal, sensor and graph applications. Targeting mobile, life sciences, oil and gas and government markets. Our unique database technology can power real-time models of reality. We are enabling completely new applications and radical enhancements to existing applications.\nThe role:\nOur product is a database purpose-built to parallelize storage and retrieval of multi-dimensional data on clusters of shared-nothing commodity hardware. It automatically shards and re-balances data across the cluster. We’re looking for a Senior Software Engineer with the vision and hands-on skills to enable early adoption as a key person in the core technical team. You will be 

In [554]:

job_description_list_02=[]
descr_link_lst=[]
for descr_link in range(len(job_lst)):
    descr_link_lst.append(job_lst[descr_link][1])

# `Option 2: call links from list, iterate links directly`

In [560]:
# headless browser
# possible wait function for page to load

# import time

start = time.time()


for link in descr_link_lst:
    option_= webdriver.ChromeOptions()

# Going undercover:
    option_.add_argument("--incognito")
    
    option_.add_argument("--headless=new")
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option_)
    driver.get(link)
#     job_page = driver.find_element(By.ID,"mosaic-jobResults")
#     jobs = job_page.find_elements(By.CLASS_NAME,"job_seen_beacon") # return a list
    sleep(randint(2, 5))
    try: 
        job_description_list_02.append(driver.find_element(By.ID,"jobDescriptionText").text)
#         print(driver.find_element(By.ID,"jobDescriptionText").text)   
    except: 
            
        job_description_list_02.append(None)
    driver.quit()
    
end = time.time()
print(end - start)

1201.497257232666


In [566]:
# Description from 2nd to last entry as illustrate
job_description_list_02[-2]

'Senior Software Engineer\nJoin an expert team that is breaking records in real-time Big Data performance\nChange the way the world manipulates and analyzes large quantities of data\nAddress our customer’s data pain points and delight them with your solutions\nSpaceCurve is building Big Data analytic solutions focused on spatial, temporal, sensor and graph applications. Targeting mobile, life sciences, oil and gas and government markets. Our unique database technology can power real-time models of reality. We are enabling completely new applications and radical enhancements to existing applications.\nThe role:\nOur product is a database purpose-built to parallelize storage and retrieval of multi-dimensional data on clusters of shared-nothing commodity hardware. It automatically shards and re-balances data across the cluster. We’re looking for a Senior Software Engineer with the vision and hands-on skills to enable early adoption as a key person in the core technical team. You will be w

In [609]:
# Trying to do Threading for speed up:

# import threading
# from selenium import webdriver

start = time.time()

class ScrapeThread(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url

threads = []
for url in descr_link_lst:
    option_= webdriver.ChromeOptions()

# Going undercover:
    option_.add_argument("--incognito")
    
    option_.add_argument("--headless=new")
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option_)
    driver.get(url)
    t = ScrapeThread(url)
    t.start()
    
    try: 
        threads.append(driver.find_element(By.ID,"jobDescriptionText").text)
 
    except: 
        threads.append(None)
        
driver.quit()


end = time.time()
print(end - start)

575.802490234375


# `Why I cannot use Beautiful Soup ANYMORE.. Let's talk`

In [661]:
# for url_link in descr_link_lst:
# job_descr_txt=[]    
# # headers=headers
# url_1='https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date'
# response = requests.get(url_1.format('data+engineer','denver'))
# # ,headers=headers)
# print(response)
# html_ = response.text
# # print(html_)
# soup_ = BeautifulSoup(html_, 'html.parser')
# print(soup_.text)


In [644]:
# Short Version to show illustration:

paginaton_url_ = 'https://www.indeed.com/jobs?q={}&l={}&sort=date&start={}'

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option)
p_=[]
salary_list_=[]
for i in range(0,3):
    driver.get(paginaton_url_.format(job_,location,i*10))
    sleep(randint(2, 3))
    
    job_page = driver.find_element(By.ID,"mosaic-jobResults")
    jobs = job_page.find_elements(By.CLASS_NAME,"job_seen_beacon") # return a list
    
    for jj in jobs:
        job_title = jj.find_element(By.CLASS_NAME,"jobTitle")
        print(job_title.text)
        p_.append(job_title.text)
#         sleep(randint(3, 5))
        try:
            salary_list_.append(jj.find_element(By.CLASS_NAME,"salary-snippet-container").text)
            print(jj.find_element(By.CLASS_NAME,"salary-snippet-container").text)

        except: 
            try: 
                salary_list.append(jj.find_element(By.CLASS_NAME,"estimated-salary").text)
                print(jj.find_element(By.CLASS_NAME,"estimated-salary").text)
            except:
                print('None')
                
driver.quit()

# //*[@id="challenge-stage"]/div/label/input

Data Engineer- W2 and onsite-(No C2C Candidates)
Estimated $101K - $128K a year
Data Engineer, Workforce Solutions, Ops HR - WW Ops Ppl Prod-Tech
From $123,700 a year
Data and Analytics Engineer - Senior Associate
Estimated $114K - $145K a year
Data Center Structural Engineer
From $93,500 a year
Senior Data Engineer (US Remote)
Estimated $125K - $158K a year
Data Engineer
None
Artica - Senior Data Applied Science Engineer (Seattle, WA) - Direct Hire [Hybrid]
$180,000 - $200,000 a year
Data Engineer
$140,000 - $190,000 a year
Staff Software Engineer - Data Science
$149,240 - $200,200 a year
Staff Software Engineer - Data Integration
$149,240 - $200,200 a year
Principal Software Engineer (Data), Industry Solutions Engineering
$133,600 - $256,800 a year
Software Engineer, Data Platform
Estimated $137K - $174K a year
Software Engineer, Data Platform
None
Senior Backend Engineer - Data
$140,000 - $215,000 a year
Customer Engineer, Data Analytics, Google Cloud
None
Staff Full Stack Engineer 

In [130]:
# df and store data


# duplicate entries remove!

In [131]:
# plots

In [657]:
# consider NLP


# class ScrapeThread(threading.Thread):
#     def __init__(self, url):
#         threading.Thread.__init__(self)
#         self.url = url

        
        
# multi-thread or asynio
# explicit wait with Ec.wait read this and above explainations

# `Future improvements for this work:`

+ Speed up: use asyncio and also look at threading to explain differences
+ clean up data, put into DF and do some plotting
+ consider more than one job type after speed up
+ look into explicit wait times

# Like, Share & <font color=red>SUB</font>scribe

# `Citations & Help:`

# ◔̯◔

https://pypi.org/project/webdriver-manager/

https://www.blog.datahut.co/post/scrape-indeed-using-selenium-and-beautifulsoup

https://github.com/henrionantony/Dynamic-Web-Scraping-using-Python-and-Selenium/blob/master/indeed.py

https://www.specrom.com/blog/web-scraping-job-postings-on-indeed-using-python/

https://www.scrapingdog.com/blog/scrape-indeed-using-python/ (bs4 as of Feb 13, 2023)

https://selenium-python.readthedocs.io/locating-elements.html#locating-elements

https://stackoverflow.com/questions/50865088/how-to-get-string-dump-of-lxml-element

https://selenium-python.readthedocs.io/navigating.html

https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf (2020 version)

https://www.pycodemates.com/2022/01/Indeed-jobs-scraping-with-python-bs4-selenium-and-pandas.html

https://medium.com/forcodesake/how-to-build-a-scraping-tool-for-indeed-in-8-minutes-data-science-csv-selenium-beautifulsoup-python-95fcca4b9719 (Good Read & Adapted Code)

https://www.tutorialspoint.com/how-to-open-browser-window-in-incognito-private-mode-using-python-selenium-webdriver

https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.keys.html

https://pythonbasics.org/selenium-wait-for-page-to-load/

https://www.seleniumeasy.com/selenium-tutorials/selenium-headless-browser-execution

https://www.browserstack.com/guide/expectedconditions-in-selenium

https://www.testim.io/blog/xpath-vs-css-selector-difference-choose/

https://www.w3.org/TR/REC-DOM-Level-1/introduction.html

https://github.com/diego-florez/Selenium-Web-Scraping/blob/master/indeed.py (Indeed scrape Selenium 2020) error Handling also

https://www.testim.io/blog/selenium-click-button/

https://scrapfly.io/blog/how-to-scrape-indeedcom/

https://goh.physics.ucdavis.edu/datascience/webscraping/webscraping.html

https://levelup.gitconnected.com/efficiently-scraping-multiple-pages-of-data-a-guide-to-handling-pagination-with-selenium-and-3ed93857f596

https://github.com/israel-dryer/Indeed-Job-Scraper/blob/master/indeed-job-scraper-selenium.ipynb

https://www.zenrows.com/blog/headless-browser-python#switch-to-python-selenium-headless-mode

https://python.plainenglish.io/pagination-techniques-to-scrape-data-from-any-website-in-python-779cd32bd514

https://www.selenium.dev/blog/2023/headless-is-going-away/ (2023 article)

https://www.zenrows.com/blog/bypass-cloudflare-python (cloudflare bot blocking 403 error)

`Code Optimizing with Asynio, multi-threading and multi-processing:`

https://www.geeksforgeeks.org/multithreading-or-multiprocessing-with-python-and-selenium/

https://www.youtube.com/watch?v=-hw3AaxX5B4

https://webnus.net/how-to-speed-up-selenium-automated-tests-in-2022/ (selenium speed up ideas)

https://medium.com/@PhysicistMarianna/scrape-job-postings-data-from-indeed-com-with-python-b4f31340ef5f (bs4 help maybe)

https://github.com/Ram-95/Indeed_Job_Scraper/blob/master/Indeed_Job_Scraper.py (bs4 idea as well)

https://www.youtube.com/watch?v=HOS5Hix--bE

https://stackoverflow.com/questions/75849391/failed-to-fetch-the-job-titles-from-indeed-using-the-requests-module (cloudscraper idea)

https://www.geeksforgeeks.org/multithreading-python-set-1/ (multi-threading ex.)

https://testdriven.io/blog/building-a-concurrent-web-scraper-with-python-and-selenium/ (come back to this! good write up with code...)

https://medium.com/analytics-vidhya/asynchronous-web-scraping-101-fetching-multiple-urls-using-arsenic-ec2c2404ecb4

# Notes for this project:

+ Filling in forms:
+ click buttons
+ possible human detection stuff

**`Xpath vs CSS selectors for retreiving data`**

+ `Xpath:` bidirectional (can go from parent to child and reverse) traversal
    + slower retrevial speed
    + text functions supported
    + pay attention to relative '//' and absolute path '/' notations
    + Think of a tree like structure to breakdown
+ `CSS:` directional (parent to child only)

`------------------------`

**`Xpath`**
+ *`Xpath`* stands for `XML Path` which is a query language used to find the path of an element in XML documents
+ Essentially you are navigating a `DOM` 
+ More flexible than using `CSS`
    + If you don't know the name of an element you can use `contains` as your key word which is great!
 
**`CSS`**
+ Most often the HTML will be styled in a cascading format and identifying elements will come from the `Class` they fall within
+ They are used to select various elements within a `DOM`
    + **`Simple selectors:`** such as finding a `Class` or `ID`
    + **`Attribute selectors:`** 
    + **`Pseudo selectors:`** such as hover boxes or check boxes as examples