# Indeed Jobs Webscraping

Explore job listings on Indeed.com by searching for a specific keyword. Simply input your desired keyword into the search bar and browse through the results. For example, you can find data scientist positions across the United States by visiting a URL like https://www.indeed.com/jobs?q=data+scientist&l=united+states.

### Part 1: Define a function or several functions to scrape the following information.
- **job title** (see (1) in Figure)
- **company** (see (2) in Figure)
- **location** (see (3) in Figure)
- **salary** (see (4) in Figure), if available
- **description** (see (1) in Figure). Note: for description, you have to get the link embedded in each job card on the left panel, and then use the link to get the full description, as shown on the right panel. Hint: you can first get other information, and then get the full description. 
- `Output`: save all jobs as a DataFrame of columns (`title, company, location, salary, description`). E.g., for the given URL, you can get 15 jobs.

![alt text](Indeed.png "Indeed")

In [1]:
import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions 
from selenium.common.exceptions import NoSuchElementException

In [2]:
def getJobs(URL):
    delay:int = 2
    # Initialize a Selenium WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    # Open the URL in the WebDriver
    driver.get(URL)

    # Create lists to store job details
    titles = []
    companies = []
    locations = []
    salaries = []
    links = []

    # Wait for the job listings to load (adjust wait time as needed)
    wait = WebDriverWait(driver, 40)
    wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "mosaic-zone")))

    # Get the job listings
    job_cards = driver.find_elements(By.CSS_SELECTOR, 'li[class="css-5lfssm eu4oa1w0"]')
    for job_card in job_cards:
        try:
            title = job_card.find_element(By.CSS_SELECTOR, 'a').text
            job_link = job_card.find_element(By.CSS_SELECTOR, 'a').get_attribute('href')
        except NoSuchElementException as e: 
            title = None
        try:
            company = job_card.find_element(By.CSS_SELECTOR, 'span.companyName').text
        except NoSuchElementException as e:
            company = None    
        try:
            location = job_card.find_element(By.CSS_SELECTOR, 'div[class="companyLocation"]').text
        except NoSuchElementException as e:
            location = None
        
        try:
            salary = job_card.find_element(By.CSS_SELECTOR, 'div[class="css-1ihavw2 eu4oa1w0"]').text
        except NoSuchElementException as e:
            salary = None


        links.append(job_link)
        titles.append(title)
        locations.append(location)
        salaries.append(salary)
        companies.append(company)

    # Create a DataFrame to store the job details
    jobs_df = pd.DataFrame({'title': titles, 'company':companies, 'location': locations, 'salary': salaries, 'link': links})
    
    time.sleep(delay)
    # Close the WebDriver
    driver.quit()

    return jobs_df

In [3]:
def get_full_description(link):
    # Initialize a list to store descriptions
    descriptions = []
    options = webdriver.ChromeOptions()
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36")
    # Initialize a Selenium WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

    for job_link in link:
        # Open the job link
        driver.get(job_link)
        try:
            # Wait for the job description to load (adjust wait time as needed)
            wait = WebDriverWait(driver, 10)
            wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR, 'div[id = "jobDescriptionTitle"]')))

            # Get the job description
            description = driver.find_element(By.CSS_SELECTOR, 'p').text
        except NoSuchElementException as e:
            description = None    
        descriptions.append(description)
        time.sleep(5)

    # Close the WebDriver
    driver.quit()

    return descriptions

In [4]:
# this is an example url, you can first get other variables
URL="https://www.indeed.com/jobs?q=data+scientist&l=united+states"
jobs=getJobs(URL)

In [5]:
jobs

Unnamed: 0,title,company,location,salary,link
0,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
1,Senior Statistician (Remote),,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
2,Principal Data Science,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
3,Senior Fraud Data Analyst,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
4,"Sr Distinguished Engineer, Generative AI Syste...",,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
5,,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
6,Data Engineer,,,,https://www.indeed.com/rc/clk?jk=9a1e3cf6ecc0a...
7,Data Scientist,,,,https://www.indeed.com/rc/clk?jk=d7c4f342616b5...
8,Data Scientist,,,,https://www.indeed.com/rc/clk?jk=0f854716ba48e...
9,Data Scientist,,,,https://www.indeed.com/rc/clk?jk=6a9e814f1444a...


In [6]:
# and then get the full description from the links 
jobs["description"]=get_full_description(jobs["link"])

In [7]:
jobs.head()

Unnamed: 0,title,company,location,salary,link,description
0,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Data Science is a broad field at NSA and a tea...
1,Senior Statistician (Remote),,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Summary
2,Principal Data Science,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Discover. A brighter future.
3,Senior Fraud Data Analyst,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,
4,"Sr Distinguished Engineer, Generative AI Syste...",,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,


### **Part 2**:  Modify the function(s) defined in Part 1 to Scrape all reviews on the first five page.

In [8]:
def getJobs(URL):
    # Initialize a Selenium WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    # Create lists to store job details
    titles = []
    companies = []
    locations = []
    salaries = []
    links = []

    for page_number in range(1, 6):
    # Build the URL for the current page
        if page_number == 1:
            page_url = URL
        else:
            page_url = f"{URL}&start={10 * (page_number - 1)}"

        # Open the URL in the WebDriver
        driver.get(page_url)

        # Wait for the job listings to load (adjust wait time as needed)
        wait = WebDriverWait(driver, 40)
        wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "mosaic-zone")))

        # Get the job listings
        job_cards = driver.find_elements(By.CSS_SELECTOR, 'li[class="css-5lfssm eu4oa1w0"]')
        for job_card in job_cards:
            try:
                title = job_card.find_element(By.CSS_SELECTOR, 'a').text
                job_link = job_card.find_element(By.CSS_SELECTOR, 'a').get_attribute('href')
            except NoSuchElementException as e: 
                title = None
            try:
                company = job_card.find_element(By.CSS_SELECTOR, 'span.companyName').text
            except NoSuchElementException as e:
                company = None    
            try:
                location = job_card.find_element(By.CSS_SELECTOR, 'div[class="companyLocation"]').text
            except NoSuchElementException as e:
                location = None
        
            try:
                salary = job_card.find_element(By.CSS_SELECTOR, 'div[class="css-1ihavw2 eu4oa1w0"]').text
            except NoSuchElementException as e:
                salary = None
            
            links.append(job_link)
            titles.append(title)
            locations.append(location)
            salaries.append(salary)
            companies.append(company)

    # Create a DataFrame to store the job details
    jobs_df = pd.DataFrame({'title': titles, 'company':companies, 'location': locations, 'salary': salaries, 'link': links})
    
    # Close the WebDriver
    driver.quit()

    return jobs_df

In [9]:
jobs=getJobs(URL)

In [10]:
jobs.head()

Unnamed: 0,title,company,location,salary,link
0,DIRECTOR - CAPITAL MODELING,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
1,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
2,"Reliability, Availability and Serviceability E...",,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
3,Data Scientist,,,,https://www.indeed.com/rc/clk?jk=88dafd90c82e6...
4,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...


In [11]:
jobs["description"]=get_full_description(jobs["link"])

In [12]:
jobs.head()

Unnamed: 0,title,company,location,salary,link,description
0,DIRECTOR - CAPITAL MODELING,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,
1,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Data Science is a broad field at NSA and a tea...
2,"Reliability, Availability and Serviceability E...",,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,
3,Data Scientist,,,,https://www.indeed.com/rc/clk?jk=88dafd90c82e6...,Division Name: Finance and Planning\nDepartmen...
4,Data Scientist,,,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,The Data Scientist will be responsible for min...


In [14]:
jobs["description"][1]

"Data Science is a broad field at NSA and a team effort that spans all the expertise needed to derive value from data. As a Data Scientist, you will uses elements of mathematics, statistics, computer science, and application-specific knowledge to gather, make, and communicate principled conclusions from data. You will employ your mathematical science, computer science, and quantitative analysis skills to develop solutions to complex data problems and take full advantage of NSA's capabilities to tackle the highest priority challenges. Responsibilities include: - Exploring data analysis and model-fitting to reveal data features of interest - Using the machine-learned predictive modeling - Constructing usable data sets from multiple sources to meet customer needs - Identifying and analyzing anomalous data (including metadata) - Developing conceptual design and models to address mission requirements - Developing qualitative and quantitative methods for characterizing datasets in various st