# `Selenium Webscraping Indeed Job Postings - July 2023`

# <font color=red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# `Purpose & Outcome:`

# `What is Selenium and how is it used?`

+ When you need to do unit testing, automation or assistance when webscraping this is a tool to aid you.
    + Great for clicking buttons
    + drop-down menus
    + acting/emulating human interactions on a webpage

In [None]:
# Install if you have never used these: unblock the lines below to install if needed

# !pip install webdriver-manager
# !pip3 install lxml
# !pip3 install selenium
# !pip3 install webdriver_manager
# !pip install --upgrade pip
# !pip install -U selenium

In [1]:
# --------- import necessary modules -------

# For webscraping
from bs4 import BeautifulSoup

# Parsing and creating xml data
from lxml import etree as et

# Store data as a csv file written out
from csv import writer

# In general to use with timing our function calls to Indeed
import time

# Assist with creating incremental timing for our scraping to seem more human
from time import sleep

# Dataframe stuff
import pandas as pd

# Random integer for more realistic timing for clicks, buttons and searches during scraping
from random import randint

In [2]:
import selenium

# Check version I am running
selenium.__version__

'4.10.0'

In [27]:
# selenium 4:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service as ChromeService

from webdriver_manager.chrome import ChromeDriverManager

In [28]:
# Allows searchs similar to beautiful soup: find_all
from selenium.webdriver.common.by import By

# Try to establish wait times for the page to load
from selenium.webdriver.support.ui import WebDriverWait

# Wait for specific condition based on defined task: web elements, boolean are examples
from selenium.webdriver.support import expected_conditions as EC

# Used for keyboard movements, up/down, left/right,delete, etc
from selenium.webdriver.common.keys import Keys

# Locate elements on page and throw error if they do not exist
from selenium.common.exceptions import NoSuchElementException

# `Consider Headless Browser: speed up & use less resources:`

There are some condiserations though:

+ Some browsers create issues
+ debugging can be tricky
+ you may have limited plugin usage or support
+ you are not able to see visually how the website or application are working 

# `from selenium.webdriver.common.by import By`

Think of this as being similar to using `Beautiful Soup and find_all`
+ when used it allows you to find something within an HTML document, if it fails you raise the exception: `NoSuchElementException`
+ **`Becareful when using BY`** because if this is not a static page then any attrubutes you are searching can become an error in the future when it fails.
    + For example if you are searching by `Class` this can create issues later vs using
        + This is because it is a `CSS` selector and can change overtime since it is an attribute
    + `ID` which may make your code more robust! This CAN be a unique identifier that may help you instead

# `NoSuchElementException`

This is useful to locate elements within a page while loading and try to handle exceptions.
+ During `AJAX` calls you may have issues if the application was build using `React, VUE, Angular` and require different use cases to make the above checks. [article to explain](https://reflect.run/articles/everything-you-need-to-know-about-nosuchelementexception-in-selenium/) and you can consider polling.

`--------------------------------`

# `Other Common Errors:`

+ **`InvalidSelectorException`**

+ **`ElementNotInteractableException`**

+ **`TimeoutException`**

In [25]:
option= webdriver.ChromeOptions()

# Going undercover:
option.add_argument("--incognito")


# # Consider this if the application works and you know how it works for speed ups and rendering!

# option.add_argument('--headless==new')



# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

# # chromedriver = r'chromedriver.exe'
# browser = webdriver.Chrome(driver,options=option)
# browser.get('indeed.com')
# driver.get("https://www.indeed.com/q-USA-jobs.html?vjk=823cd7ee3c203ac3")

In [21]:
# Define job and location search keywords
job_search_keyword = ['Data+Scientist', 'Business+Analyst', 'Data+Engineer', 
                      'Python+Developer', 'Full+Stack Developer', 'Machine+Learning+Engineer']

# Define Locations of Interest
location_search_keyword = ['New+York', 'California', 'Los+Angeles']

# Define base and pagination URL's
base_url = 'https://www.indeed.com'

# Finding location, position of interest and starting page
paginaton_url = "https://www.indeed.com/jobs?q={}&l={}&radius=35&start={}"

# Things to consider:

+ Wait for page to load before we start running tasks
+ make sure what we are looking for is actually there
    + It can be absent
    + hidden in DOM, iframe or similar
+ timing our calls to remain more like an average user
+ Exception handling

# `Here is a side note:`


+ This gives me an error because it was code from the past version:

`driver = webdriver.Chrome(ChromeDriverManager().install())`


In [26]:
# Initialize Chrome Webdriver using Chrome Driver Manager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),
                         options=option)

sleep(randint(3, 11))
# driver.implicitly_wait(.72)

# Open initial URL, and willl wait till its fully loaded.
driver.get("https://www.indeed.com/q-USA-jobs.html?vjk=823cd7ee3c203ac3")

In [8]:
# Consider a few options:

# 1.) Try to use incognito -----------(done)
# 2.) Maybe I should use random int for sleep
# 3.) What to do when I have the human click button for pop up
# 4.) verify if the "+" symbols are needed look at formatting ----------(done)
# 5.) check if the formatting to parse is the same or not for div tags, etc
# 6.) do I need to use headless browser?
# 7.) locating elements using the BY package, similar to beautiful soup find_all
# 8.) errors with NoSuchElementException
# 9.) try to identify code that doesn't change over time
# 10.) Xpath to find buttons to go page by page and contain arrows forward/backward with try.except

# Notes for this project:

+ Filling in forms:



In [None]:
url="https://www.indeed.com/q-USA-jobs.html?vjk=823cd7ee3c203ac3"

In [None]:
# function to get DOM from given URL
def get_dom(url):
    driver.get(url)
    page_content = driver.page_source
    product_soup = BeautifulSoup(page_content, 'html.parser')
    dom = et.HTML(str(product_soup))
    return dom

In [None]:
for i in get_dom(url):
    print(et.tostring(i))

In [None]:
# functions to extract job link
def get_job_link(job):
    try:
        job_link = job.xpath('./descendant::h2/a/@href')[0]
    except Exception as e:
        job_link = 'Not available'
    return job_link


# functions to extract job title
def get_job_title(job):
    try:
        job_title = job.xpath('./descendant::h2/a/span/text()')[0]
    except Exception as e:
        job_title = 'Not available'
    return job_title


# functions to extract the company name
def get_company_name(job):
    try:
        company_name = job.xpath('./descendant::span[@class="companyName"]/text()')[0]
    except Exception as e:
        company_name = 'Not available'
    return company_name


# functions to extract the company location
def get_company_location(job):
    try:
        company_location = job.xpath('./descendant::div[@class="companyLocation"]/text()')[0]
    except Exception as e:
        company_location = 'Not available'
    return company_location


# functions to extract salary information
def get_salary(job):
    try:
        salary = job.xpath('./descendant::span[@class="estimated-salary"]/span/text()')
    except Exception as e:
        salary = 'Not available'
    if len(salary) == 0:
        try:
            salary = job.xpath('./descendant::div[@class="metadata salary-snippet-container"]/div/text()')[0]
        except Exception as e:
            salary = 'Not available'
    else:
        salary = salary[0]
    return salary


# functions to extract job type
def get_job_type(job):
    try:
        job_type = job.xpath('./descendant::div[@class="metadata"]/div/text()')[0]
    except Exception as e:
        job_type = 'Not available'
    return job_type


# functions to extract job rating
def get_rating(job):
    try:
        rating = job.xpath('./descendant::span[@class="ratingNumber"]/span/text()')[0]
    except Exception as e:
        rating = 'Not available'
    return rating


# functions to extract job description
def get_job_desc(job):
    try:
        job_desc = job.xpath('./descendant::div[@class="job-snippet"]/ul/li/text()')
    except Exception as e:
        job_desc = ['Not available']
    if job_desc:
        job_desc = ",".join(job_desc)
    else:
        job_desc = 'Not available'
    return job_desc

In [None]:
# functions to extract job link
def get_job_link(job):
    practice_=[]
    try:
        job_link = job.xpath('./descendant::h2/a/@href')[0]
    except Exception as e:
        job_link = 'Not available'
    practice_.append(job_link)
#     return job_link
    return practice_

for job_keyword in job_search_keyword:
    for location_keyword in location_search_keyword:
        all_jobs = []
        for page_no in range(0, 10, 10):
            url = paginaton_url.format(job_keyword, location_keyword, page_no)
            page_dom = get_dom(url)
            jobs = page_dom.xpath('//div[@class="job_seen_beacon"]')
        all_jobs.append(jobs)
#             all_jobs = all_jobs + jobs
#     print(all_jobs)


In [None]:
all_jobs

In [None]:
# Open a CSV file to write the job listings data
# with open('indeed_jobs1.csv', 'w', newline='', encoding='utf-8') as f:
#     theWriter = writer(f)
#     heading = ['job_link', 'job_title', 'company_name', 'company_location', 'salary', 'job_type', 'rating', 'job_description', 'searched_job', 'searched_location']
#     theWriter.writerow(heading)
#     for job_keyword in job_search_keyword:
#         for location_keyword in location_search_keyword:
#             all_jobs = []
#             for page_no in range(0, 100, 10):
#                 url = paginaton_url.format(job_keyword, location_keyword, page_no)
#                 page_dom = get_dom(url)
#                 jobs = page_dom.xpath('//div[@class="job_seen_beacon"]')
# #                 all_jobs = all_jobs +jobs
#                 print(all_jobs+jobs)
#                 all_jobs_ = all_jobs.append(jobs) #changed here and below
#                 print("yay",all_jobs_)
#             for job in all_jobs_:
#                 job_link = base_url + get_job_link(job)
#                 time.sleep(2)
#                 job_title = get_job_title(job)
#                 time.sleep(2)
#                 company_name = get_company_name(job)
#                 time.sleep(2)
#                 company_location = get_company_location(job)
#                 time.sleep(2)
#                 salary = get_salary(job)
#                 time.sleep(2)
#                 job_type = get_job_type(job)
#                 time.sleep(2)
#                 rating = get_rating(job)
#                 time.sleep(2)
#                 job_desc = get_job_desc(job)
#                 time.sleep(2)
#                 record = [job_link, job_title, company_name, company_location, salary, job_type, rating, job_desc, job_keyword, location_keyword]
#                 theWriter.writerow(record)

# # Closing the web browser
# driver.quit()

In [None]:
for job_keyword in job_search_keyword:
    for location_keyword in location_search_keyword:
#         print(job_keyword)
        all_jobs = []
        for page_no in range(0, 10, 10): # changed 0,100,10
            url = paginaton_url.format(job_keyword, location_keyword, page_no)
            page_dom = get_dom(url)
            jobs = page_dom.xpath('//div[@class="job_seen_beacon"]')
#             all_jobs_ = all_jobs.append(jobs)
            print(jobs.text)
#             all_jobs_ = all_jobs+jobs #changed here and below
#             print("yay",all_jobs_)
#         for job in all_jobs_:
#             job_link = base_url + get_job_link(job)
#             time.sleep(2)
#             job_title = get_job_title(job)
#             time.sleep(2)
#             company_name = get_company_name(job)
#             time.sleep(2)
#             company_location = get_company_location(job)
#             time.sleep(2)
#             salary = get_salary(job)
#             time.sleep(2)
#             job_type = get_job_type(job)
#             time.sleep(2)
#             rating = get_rating(job)
#             time.sleep(2)
#             job_desc = get_job_desc(job)
#             time.sleep(2)
#             print(job_link,company_location,salary,job_type,rating)

In [None]:
import pandas as pd

pd.read_csv('indeed_jobs1.csv')

# Like, Share & <font color=red>SUB</font>scribe

# `Citations & Help:`

# ◔̯◔

https://pypi.org/project/webdriver-manager/

https://www.blog.datahut.co/post/scrape-indeed-using-selenium-and-beautifulsoup

https://github.com/henrionantony/Dynamic-Web-Scraping-using-Python-and-Selenium/blob/master/indeed.py

https://www.specrom.com/blog/web-scraping-job-postings-on-indeed-using-python/

https://www.scrapingdog.com/blog/scrape-indeed-using-python/

https://selenium-python.readthedocs.io/locating-elements.html#locating-elements

https://stackoverflow.com/questions/50865088/how-to-get-string-dump-of-lxml-element

https://selenium-python.readthedocs.io/navigating.html

https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf

https://www.pycodemates.com/2022/01/Indeed-jobs-scraping-with-python-bs4-selenium-and-pandas.html

https://medium.com/forcodesake/how-to-build-a-scraping-tool-for-indeed-in-8-minutes-data-science-csv-selenium-beautifulsoup-python-95fcca4b9719

https://www.tutorialspoint.com/how-to-open-browser-window-in-incognito-private-mode-using-python-selenium-webdriver

https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.keys.html

https://pythonbasics.org/selenium-wait-for-page-to-load/

https://www.seleniumeasy.com/selenium-tutorials/selenium-headless-browser-execution