# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

options = Options()
options.binary_location = 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
driver_path = "C:/Users/shiva/Desktop/UofT/Term3-Fall2022/APS1624/Assignment 3/geckodriver.exe"
# Linux
#driver_path = "./drivers/linux/geckodriver"
driver = webdriver.Firefox(executable_path=driver_path, options=options)

  driver = webdriver.Firefox(executable_path=driver_path, options=options)


### Define position and location 

In [3]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 100

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Job number    1 added - Data Scientist
Job number    2 added - Data Scientist, Marketing & Online (Remote)
Job number    3 added - Data Scientist - NLP
Job number    4 added - Data Scientist - Telecommute
Job number    5 added - Jr. Data Scientist
Job number    6 added - Jr. Data Scientist
Job number    7 added - Data Scientist
Job number    8 added - Data Analyst
Job number    9 added - Customer Data Scientist
Job number   10 added - Junior Machine Learning Engineer
Job number   11 added - Senior Data Scientist
Job number   12 added - Principal Operations Research Scientist - Network Planning/Future Transportation (Optimization, Machine Learning) Remote or HQ
Job number   13 added - Nurse Data Miner
Job number   14 added - Data Scientist
Job number   15 added - Data Scientist
Job number   16 added - Nurse Data Miner
Job number   17 added - Data Scientist
Job number   18 added - Data Scientist (Remote)
Job number   19 added - Data Scientist Co-Op (Spring 2023)
Job number   20 added - D

In [5]:
driver.quit()

### Scrape full job descriptions

In [6]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [7]:
import random
import time

In [8]:
descriptions=[]
driver = webdriver.Firefox(executable_path=driver_path, options=options)
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

  driver = webdriver.Firefox(executable_path=driver_path, options=options)


In [9]:
driver.quit()

### Save results

In [10]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [11]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,"Shaw Industries Group, Inc.",Remote,3.8,PostedPosted 3 days ago,,Partner with data scientists across the enterp...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,We are looking for a data scientist to join ou...
1,"Data Scientist, Marketing & Online (Remote)",The Home Depot,"Remote in Atlanta, GA 30361",3.7,PostedPosted 11 days ago,"$90,000 - $160,000 a year",55% Solution Development - Design and develop ...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Position Purpose:\nThe Data Scientist is respo...
2,Data Scientist - NLP,"Ursus, Inc.","Remote in Menlo Park, CA 94025",4.9,PostedPosted 2 days ago,$40.00 - $48.65 an hour,"Apply knowledge of statistics, machine learnin...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,JOB TITLE: Data Scientist - NLP\nLOCATION: Rem...
3,Data Scientist - Telecommute,UnitedHealth Group,"Remote in Eden Prairie, MN 55344",3.6,PostedPosted 6 days ago,,"Work alongside other data scientists, engineer...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Combine two of the fastest-growing fields on t...
4,Jr. Data Scientist,AffixITPro,Remote,,PostedToday,$30 - $40 an hour,The Data Scientist supports the development of...,https://www.indeed.com/company/AffixITPro/jobs...,Job Description\nAffixIT Pro has been awarded ...
...,...,...,...,...,...,...,...,...,...
145,Data Scientist,PocketPills,"Remote in San Francisco, CA 94103",,PostedPosted 30+ days ago,,Proven ability to tailor data-driven insights ...,https://www.indeed.com/rc/clk?jk=ae6e092dd67f3...,Company Description\nPocketPills is a tech-dri...
146,"Software Engineer III, AI/ML Threat Prediction...","CrowdStrike, Inc.","Remote in Hempstead, NY 11551",3.3,PostedPosted 5 days ago,"$120,000 - $190,000 a year",Understanding data structures and commands for...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,About the Role:\nCrowdstrike’s Proactive Secur...
147,Applied Scientist - Text-To-Speech (TTS),Veritone,"Remote in New York, NY",3.1,PostedPosted 1 day ago,"$140,000 a year",Mental health awareness and support.\nThis pos...,https://www.indeed.com/rc/clk?jk=4514adbc3a77c...,WE ARE VERITONE\nWe are driven by the belief t...
148,Data Scientist II,Tek Ninjas,"Remote in Seattle, WA 98101",,PostedPosted 9 days ago,,"Ability to process large sets of data, and cre...",https://www.indeed.com/rc/clk?jk=e8ccf157fe743...,Data Scientist II | One-year contract\nHelp us...
