# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
driver_path = "/Users/dhairyaparmar/geckodriver"
# Linux
#driver_path = "./drivers/linux/geckodriver"
driver = webdriver.Firefox(executable_path=driver_path)

  driver = webdriver.Firefox(executable_path=driver_path)


### Define position and location 

In [3]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 100

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Job number    1 added - Data Scientist
Job number    2 added - Data Scientist (All Levels)
Job number    3 added - Data Scientist - RWD
Job number    4 added - Jr. Data Scientist
Job number    5 added - Machine Learning Engineer
Job number    6 added - Data Scientist
Job number    7 added - Data Scientist (US Remote Eligible)
Job number    8 added - Data Scientist
Job number    9 added - Jr. Data Scientist
Job number   10 added - Associate Data Scientist
Job number   11 added - Computational Biologist / Data Scientist
Job number   12 added - Interdisciplinary-Microbiologist/Data Scientist
Job number   13 added - Junior Data Scientist
Job number   14 added - Data Scientist I, Product Analytics
Job number   15 added - Senior Data Scientist
Job number   16 added - Data Scientist
Job number   17 added - Data Scientist jobs
Job number   18 added - Data Scientist
Job number   19 added - Data Scientist / NLP (Python, Django/Flask, NLP, Clustering, Rest API)
Job number   20 added - Senior Data

In [5]:
driver.quit()

### Scrape full job descriptions

In [6]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [7]:
import random
import time

In [8]:
driver = webdriver.Firefox(executable_path=driver_path)
descriptions=[]
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

  driver = webdriver.Firefox(executable_path=driver_path)


In [9]:
driver.quit()

### Save results

In [10]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
# dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [11]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,Data Products LLC,Remote,,PostedToday,"$80,000 - $120,000 a year",Present information using data visualization t...,https://www.indeed.com/company/Data-Products-L...,About us\nWe are professional and data-driven....
1,Data Scientist (All Levels),Noblis,"Remote in Reston, VA 20191",4.0,PostedToday,,"In your role, you will work on multiple projec...",https://www.indeed.com/rc/clk?jk=b8f1b308b3018...,Responsibilities:\nNoblis is seeking to hire D...
2,Data Scientist - RWD,Norstella,Remote,,EmployerActive 5 days ago,"$125,000 - $175,000 a year",Design data pipelines and queries and analyze ...,https://www.indeed.com/company/NorStella/jobs/...,Job Summary:\nWe are seeking an experienced Da...
3,Jr. Data Scientist,Net2Aspire,Remote,,EmployerActive 4 days ago,"$65,000 - $80,000 a year", Create data dashboards and other data visual...,https://www.indeed.com/company/net2aspire/jobs..., Apply Statistical and Machine Learning metho...
4,Machine Learning Engineer,Idea Evolver,Remote,,EmployerActive 4 days ago,"$120,000 - $160,000 a year","Work with product developers (tech regulatory,...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Company Overview\nIdea Evolver specializes in ...
...,...,...,...,...,...,...,...,...,...
145,Data Engineer/Data Scientist,etrailer.com,Remote,3.2,PostedPosted 30+ days ago,"$100,000 - $180,000 a year","Experienced in designing, implementing, and ma...",https://www.indeed.com/rc/clk?jk=1292d743fbc2f...,Mid-to-Senior Level Data Engineer/Data Scienti...
146,Data Scientist,ECS Federal,Remote,,EmployerActive 5 days ago,"$140,000 - $160,000 a year",5 years of analytics experience with designing...,https://www.indeed.com/company/ERPi/jobs/Data-...,ECS is seeking a Data Scientist to work fully ...
147,Data Scientist,Mashvisor Inc.,+1 locationRemote,3.0,PostedPosted 5 days ago,,"Develop machine learning models, write product...",https://www.indeed.com/rc/clk?jk=007e70ed4a993...,Build the solution that transforms the real es...
148,Artificial Intelligence Developer,Ultrafly Solutions private limited,Remote,,PostedPosted 1 day ago,,Collaborate with data scientists and other sta...,https://www.indeed.com/company/Ultrafly-Soluti...,We are seeking an experienced Artificial Intel...
