# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [23]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [24]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = 'C:\programs\geckodriver.exe'
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [25]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "Toronto"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [26]:
## Number of postings to scrape
postings = 2000

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 10))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Machine Learning/Data Engineer
Job number    2 added - Data Scientist
Job number    3 added - Data Science Manager
Job number    4 added - Director, Data Science
Job number    5 added - Sr. Data Analyst
Job number    6 added - Senior Fiscal Data Analyst
Job number    7 added - Data Scientist � Qatar � TS/SCI
Job number    8 added - Sr. Data Scientist
Job number    9 added - Data Scientist Senior
Job number   10 added - Data Analyst Senior - Remote
Job number   11 added - Treasury Data Analyst Senior - Remote
Job number   12 added - Principal Statistical Programmer (Remote)
Job number   13 added - Manager, Statistical Programming (Remote)
Job number   14 added - Statistical Programmer II (Remote)
Job number   15 added - Data Scientist
Job number   16 added - Machine Learning/Data Engineer
Job number   17 added - Data Scientist
Job number   18 added - Data Science Manager
Job number   19 added - Director, Data Science
Job number   20 added - Sr. Data Analyst
Job n

### Scrape full job descriptions

In [27]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [28]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [29]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Principal Statistical Programmer (Remote),PAREXEL,Remote in United States,3.6,PostedPosted 6 days ago,,"Provide leadership, project specific training,...",/pagead/clk?mo=r&ad=-6NYlbfkN0Awiy0szp24tPN-CL...,Be part of our empowered Parexel Statistical P...
1,Data Scientist,ConnectiveRx,"Pittsburgh, PA 15275",2.5,PostedPosted 12 days ago,,"Strong analytics skills, ability to interpret ...",/rc/clk?jk=cc77fa3f85227d70&fccid=f91bc921fa96...,"ConnectiveRx is a leading, technology-enabled ..."
2,Machine Learning/Data Engineer,Rosen,Remote in Ohio,3.7,PostedPosted 30+ days ago,"$60,000 - $124,000 a year",This is a remote hybrid position ROSEN is a le...,/pagead/clk?mo=r&ad=-6NYlbfkN0D4Xqdf8KU0xH7oiY...,This is a remote hybrid position\nROSEN is a l...
3,Data Scientist,ConnectiveRx,"Pittsburgh, PA 15275",2.5,PostedPosted 12 days ago,,"ConnectiveRx is a leading, technology-enabled ...",/rc/clk?jk=cc77fa3f85227d70&fccid=f91bc921fa96...,"ConnectiveRx is a leading, technology-enabled ..."
4,Data Science Manager,Maronda Inc. and Subsidiaries,"Imperial, PA 15126",,PostedPosted 30+ days ago,,Position Overview The Data Science Manager rol...,/rc/clk?jk=8f0ae7db102b2330&fccid=dd616958bd9d...,Position Overview\nThe Data Science Manager ro...
5,"Director, Data Science",DICK'S Sporting Goods,"Remote in Coraopolis, PA 15108",3.4,PostedPosted 30+ days ago,,The Director of Data Science plays a leadershi...,/rc/clk?jk=c843c5dfd64b36fb&fccid=55a2bdb0a91b...,The Director of Data Science plays a leadershi...
6,Sr. Data Analyst,ConnectiveRx,"Pittsburgh, PA 15275",2.5,PostedPosted 25 days ago,,"ConnectiveRx is a leading, technology-enabled ...",/rc/clk?jk=962de45806d34472&fccid=f91bc921fa96...,"ConnectiveRx is a leading, technology-enabled ..."
7,Senior Fiscal Data Analyst,Beaver County PA,"Beaver, PA 15009",,PostedPosted 30+ days ago,,POSITION DESCRIPTION: To provide complex fisca...,/rc/clk?jk=edbe74bb6db88921&fccid=5ebd2c0a66d2...,POSITION DESCRIPTION:\nTo provide complex fisc...
8,Data Scientist � Qatar � TS/SCI,Peraton,United States,3.2,PostedPosted 30+ days ago,"$115,000 a year",Responsibilities: In support of the Operationa...,/pagead/clk?mo=r&ad=-6NYlbfkN0BWrJOJIc9CpN6yMp...,Responsibilities:\nIn support of the Operation...
9,Sr. Data Scientist,Embrace Pet Insurance,United States,3.9,PostedPosted 30+ days ago,,Overview: Company Description Embrace Pet Insu...,/pagead/clk?mo=r&ad=-6NYlbfkN0AftNSAXRSndXnucC...,Overview:\nCompany Description\nEmbrace Pet In...
