# Extracting jobs from Indeed.com

- [Useful prior information](#info)
- [Importing libraries](#imports)
- [Defining functions](#functions)
- [Running program](#exec)
- [Saving data](#save)
- [Next Steps](#steps)

<a id=info></a>

## Useful Prior Information:
- The main rule is to use API's over webscraper and [Indeed has an open source API.](https://opensource.indeedeng.io/api-documentation/) Having said this, I am not using this for any commercial purpose, I wanted to learn Selenium, so I used it instead of the API
- [Selenium documentation](https://selenium-python.readthedocs.io/)
- [Download Chromedriver](https://sites.google.com/a/chromium.org/chromedriver/downloads) and put it in the same folder or on a PATH you can easily locate
- [Recommendations on webscraping](https://hackernoon.com/how-to-scrape-a-website-without-getting-blacklisted-271a605a0d94)
- Use a [VPN](https://stackoverflow.com/a/59512170) or [Proxies.](https://www.quora.com/I-was-scraping-a-website-and-they-blocked-me-How-can-I-get-around-this) You might want to use [ProxiCrawl](https://proxycrawl.com/scraping-api-avoid-captchas-blocks) if you do not have a VPN
- Check [what are robots.txt](https://www.cloudflare.com/learning/bots/what-is-robots.txt/) and [Indeed's robots.txt](https://www.indeed.com/robots.txt)
- [The legality of webscraping](https://parsers.me/us-court-fully-legalized-website-scraping-and-technically-prohibited-it/)

Even with the [LinkedIn vs HiQ](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) case setting a precedent, be mindful of the ToS of every website. The following message can be found on [Indeed's Terms and Conditions](https://www.indeed.com/legal?hl=en&redirect=true) for all users:

<div class="alert alert-danger" role="alert">

**2. Using our Site**

Use of any automated system or software, whether operated by a third party or otherwise, to extract data from the Site (such as screen scraping or crawling) is prohibited. Indeed reserves the right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorized use of the Site. If you wish to make commercial use of the Site, if you wish to make use of the Site in any capacity other than that of a Jobseeker or Employer, or if you wish to purchase Indeed services that utilize the Site, you must have a prior written agreement with Indeed to do so, or have accepted Indeed’s online terms of service. Please contact us for more information. We reserve the right at all times (but will not have any obligation) to terminate users, and reclaim usernames or URLs, for any reason.
    
</div>

<a id=imports></a>

## Imports

In [2]:
#import selenium
import pandas as pd
import os
import time 

from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

<a id=functions></a>

## Defining Functions

In [3]:
def get_info_indeed(info_list):
    
    column = []
    information = []

    if 'title' in info_list:
        column.append('title')
        titles = []
        x = [titles.append(i.find_element_by_tag_name('a').get_attribute('title'))\
             for i in driver.find_elements_by_tag_name('h2') if i.text != '']
            
        information.append(titles)

    if 'company' in info_list:
        column.append('company')
        companies = []
        x = [companies.append(i.text) for i in driver.find_elements_by_class_name('company') if i.text != '']
        information.append(companies)
    
    if 'link' in info_list:
        column.append('link')
        links = []
        x = [links.append(i.find_element_by_tag_name('a').get_attribute('href'))\
             for i in driver.find_elements_by_class_name('title') if i.text != '']
        information.append(links)
        
    if 'date_listed' in info_list:
        column.append('date_listed')
        dates = []
        x = [dates.append(i.text) for i in driver.find_elements_by_class_name('date') if i.text != '']
        information.append(dates)
        
    if 'location' in info_list:
        column.append('location')
        location = []
        x = [location.append(i.text) for i in driver.find_elements_by_class_name('location.accessible-contrast-color-location') if i.text != ''] 
        # explanation on why we need the full-stops: https://stackoverflow.com/questions/58422998/selenium-python-find-elements-by-class-name-returns-nothing
        information.append(location)
        
    if 'salary' in info_list:
        column.append('salary')
        salary = []
        for i in driver.find_elements_by_class_name('jobsearch-SerpJobCard.unifiedRow.row.result.clickcard'):
            try:
                salary.append(i.find_elements_by_class_name('salaryText')[0].text)
            except:
                salary.append('None')
        information.append(salary)
                
    if 'remote' in info_list:
        column.append('remote')
        remote = []
        for i in driver.find_elements_by_class_name('sjcl'):
            try:
                remote.append(i.find_elements_by_class_name('remote')[0].text)
            except:
                remote.append('None')
        information.append(remote)
        
    if 'rating' in info_list:
        column.append('rating')
        rating = []
        for i in driver.find_elements_by_class_name('sjcl'):
            try:
                rating.append(i.find_elements_by_class_name('ratingsContent')[0].text)
            except:
                rating.append('None')
        information.append(rating)
        
    if 'easy_apply' in info_list:
        column.append('easy_apply')
        easy = []
        for i in driver.find_elements_by_class_name('jobsearch-SerpJobCard.unifiedRow.row.result.clickcard'):
            try:
                easy.append(i.find_elements_by_class_name('iaLabel.iaIconActive')[0].text)
            except:
                easy.append('None')
        information.append(easy)        
        
        
    jobs = {}
    
    for j in range(len(column)):
        jobs[column[j]] = information[j]

    
    return jobs

def get_descriptions_indeed():
    
    descriptions = []
    
    ids = [i.find_element_by_tag_name('a').get_attribute('id') for i in driver.find_elements_by_tag_name('h2') if i.text != '']
    
    for job in ids:
        driver.find_elements_by_id(job)[0].click()

        time.sleep(4) #https://hackernoon.com/how-to-scrape-a-website-without-getting-blacklisted-271a605a0d94

        driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
        descriptions.append(driver.find_element_by_id('jobDescriptionText').text)

        driver.switch_to.parent_frame()
        time.sleep(4)
    
    return descriptions

<a id=exec></a>

## Running Program

In [None]:
#specify what jobs you want to search for
keywords = 'machine learning engineer'
location = 'California'

#specify driver path in my case I have it in the same folder as this document
DRIVER_PATH = 'chromedriver'
driver = webdriver.Chrome(executable_path = DRIVER_PATH)

#determine the website (in our case indeed.com)
driver.get('https://indeed.com')
time.sleep(2)

#searches the keyword specified above
search_job = driver.find_elements_by_id('text-input-what')[0]
search_job.send_keys([keywords])

#search location I use COMMAND because it is a mac computer (in a Windows computer change to CONTROL)
search_job = driver.find_elements_by_id('text-input-where')[0]
search_job.send_keys(Keys.COMMAND,"a",Keys.BACKSPACE)
search_job.send_keys([location])
time.sleep(2)

#this clicks the "find jobs" button
initial_search_button = driver.find_element_by_class_name('icl-WhatWhere-buttonWrapper')
initial_search_button.click()

#this will tell you how many jobs there currently are for your search
print(driver.find_element_by_id('searchCountPages').text)
page_counter = 1

#specify what information you want to extract
info_list = ['title', 'company', 'link', 'date_listed', 'location', 'salary', 'remote', 'rating', 'easy_apply']
df = pd.DataFrame(get_info_indeed(info_list))

#this calls the description function to get detailed information
df['description'] = get_descriptions_indeed()

#clicks on "next page"
driver.find_element_by_xpath("//a[@aria-label='Next']").click()
print("Navigating to Next Page")
page_counter += 1
print("You are currently in page ", page_counter)
time.sleep(3)

#on the second page normally a pop-up appears asking you to put an email alert, this closes it
if driver.find_element_by_id("popover-x").is_displayed() == True:
    driver.find_element_by_id('popover-x').click()
    time.sleep(3)

else:
    pass

#extract data from the second page
new_df = pd.DataFrame(get_info_indeed(info_list))
new_df['description'] = get_descriptions_indeed()
df = pd.concat([df, new_df], ignore_index=True)

# for the next pages we can create a while loop    
while True:

    try:   
        driver.find_element_by_xpath("//a[@aria-label='Next']").click()
        print("Navigating to Next Page")
        page_counter += 1
        print("You are currently in page ", page_counter)
        time.sleep(2)
        
        new_df = pd.DataFrame(get_info_indeed(info_list))
        new_df['description'] = get_descriptions_indeed()
        df = pd.concat([df, new_df], ignore_index=True)

    except (TimeoutException, WebDriverException) as e:
        print("Last page reached or you were kicked out")
        break

driver.close()
driver.quit()   

In [5]:
len(df)

30

In [6]:
df.tail(3)

Unnamed: 0,title,company,link,date_listed,location,salary,remote,rating,easy_apply,description
27,Intern - Interpretable Machine Learning RD Und...,Sandia National Laboratories,https://www.indeed.com/rc/clk?jk=bcd9484e9c9e9...,10 days ago,"Livermore, CA 94551",,,4.2,,What Your Job Will Be Like\n\nWe are seeking a...
28,Machine Learning Engineer,Apple,https://www.indeed.com/rc/clk?jk=20a00c5efac63...,8 days ago,"Santa Clara Valley, CA 95014",,,4.2,,"Summary\nPosted: Oct 21, 2020\nWeekly Hours: 4..."
29,Machine Learning Engineer,ADAPT Technology LLC.,https://www.indeed.com/company/Adapt-Technolog...,9 days ago,"Mountain View, CA 94043",$85 - $90 an hour,Temporarily remote,,Easily apply,Machine Learning Engineer - Intelligent Mobili...


<a id=save></a>

## Saving Data

In [7]:
from datetime import date

today = date.today().strftime("%d_%m_%Y")
today

'30_10_2020'

In [8]:
df.to_csv('data/'+today+'_'+keywords+'_'+location+'.csv', index=False)

<a id=steps></a>

## Next Steps

Here are a set of improvements someone with some time could work on:
- Implement Advanced Search
- Automatically apply to companies that accept "easy apply"
- Implement stopping points (sometimes you do not want all the pages)