# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [3]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    for card in soup.select("div.result-card__contents"):
        title = card.findChild("h3", recursive=False)
        company = card.findChild("h4", recursive=False)
        location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
        titles.append(title.string)
        companies.append(company.string)
        locations.append(location.string)
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    return data

In [4]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,Title,Company,Location


In [5]:
#!pip install selenium
from selenium import webdriver
import time

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()
driver.maximize_window()
# Open a web page
driver.get("https://www.linkedin.com")

jobs_button = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="guest_homepage-basic_guest_nav_menu_jobs"]')[0]
jobs_button.click()
time.sleep(1)

sign_in_button = driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="public_jobs_contextual-sign-in-modal_modal_dismiss"]')[0]
sign_in_button.click()
time.sleep(1)

country_input = driver.find_elements(by='xpath', value='//input[@id="job-search-bar-location"]')[0]
time.sleep(1)
country_input.clear()
country_input.send_keys("germany")
time.sleep(1)

job_input = driver.find_elements(by='xpath', value='//input[@data-tracking-control-name="public_jobs_dismissable-input"]')[0]
job_input.send_keys("python")
time.sleep(1)
job_list = driver.find_elements(by='xpath', value='//li[@id="keywords-1"]')[0]
job_list.click()

count = 0
while (not driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="infinite-scroller_show-more"]')) or count<=10:
    print(count)
    count += 1
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

job_posts = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="public_jobs_jserp-result_search-card"]')
time.sleep(1)

job_names = []
#company_names = []
job_links = []

for job_post in job_posts:
    if job_post.get_attribute('href'):
        job_links.append(job_post.get_attribute('href'))
    else:
        job_links.append(None)

    if job_post.find_element(by='xpath', value='.//span'):
        job_names.append(job_post.find_element(by='xpath', value='.//span').text)
    else:
        job_names.append(None)

    #if job_post.find_element(by='xpath', value='.//h4'):
    #    company_names.append(job_post.find_element(by='xpath', value='.//h4').text)
    #else:
    #    company_names.append(None)
    
time.sleep(2)
driver.quit()

0
1
2
3
4
5
6
7
8
9
10


In [6]:
import pandas as pd
pd.DataFrame({"Job Title": job_names, "Link": job_links})

Unnamed: 0,Job Title,Link
0,Senior Python Software Engineer,https://de.linkedin.com/jobs/view/senior-pytho...
1,AI Engineer (Intern),https://de.linkedin.com/jobs/view/ai-engineer-...
2,Internship - Software Development,https://de.linkedin.com/jobs/view/internship-s...
3,"Software Engineer III, Core Development, Python",https://de.linkedin.com/jobs/view/software-eng...
4,Software Developer,https://de.linkedin.com/jobs/view/software-dev...
...,...,...
75,2025-2026 AI/ML Research Fellow,https://de.linkedin.com/jobs/view/2025-2026-ai...
76,Strategy Analyst,https://de.linkedin.com/jobs/view/strategy-ana...
77,Junior Software Engineer,https://de.linkedin.com/jobs/view/junior-softw...
78,Graduate Software Engineer,https://de.linkedin.com/jobs/view/graduate-sof...


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [11]:
# your code here
#!pip install selenium
from selenium import webdriver
import time
import pandas as pd


def scrape_linkedin_job_search_2():

    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome()
    driver.maximize_window()
    # Open a web page
    driver.get("https://www.linkedin.com")

    jobs_button = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="guest_homepage-basic_guest_nav_menu_jobs"]')[0]
    jobs_button.click()
    time.sleep(1)

    sign_in_button = driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="public_jobs_contextual-sign-in-modal_modal_dismiss"]')[0]
    sign_in_button.click()
    time.sleep(1)

    country_input = driver.find_elements(by='xpath', value='//input[@id="job-search-bar-location"]')[0]
    time.sleep(1)
    country_input.clear()
    country_input.send_keys("germany")
    time.sleep(1)

    job_input = driver.find_elements(by='xpath', value='//input[@data-tracking-control-name="public_jobs_dismissable-input"]')[0]
    job_input.send_keys("python")
    time.sleep(1)
    job_list = driver.find_elements(by='xpath', value='//li[@id="keywords-1"]')[0]
    job_list.click()

    count = 0
    # while loop to search for more than the initial set of jobs
    while (not driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="infinite-scroller_show-more"]')) or count<=10:
        
        count += 1
        # upward scroll is necessary so linked in allows the next lazyload batch
        driver.execute_script("window.scrollBy(0, -300);")

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

    job_posts = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="public_jobs_jserp-result_search-card"]')
    time.sleep(1)

    job_names = []
    #company_names = []
    job_links = []

    for job_post in job_posts:
        if job_post.get_attribute('href'):
            job_links.append(job_post.get_attribute('href'))
        else:
            job_links.append(None)

        if job_post.find_element(by='xpath', value='.//span'):
            job_names.append(job_post.find_element(by='xpath', value='.//span').text)
        else:
            job_names.append(None)

        #if job_post.find_element(by='xpath', value='.//h4'):
        #    company_names.append(job_post.find_element(by='xpath', value='.//h4').text)
        #else:
        #    company_names.append(None)
        
    time.sleep(2)
    driver.quit()

    df = pd.DataFrame({"Job Title": job_names, "Link": job_links})
    return df



In [10]:
df = scrape_linkedin_job_search_2()

print(df)

0
1
2
3
4
5
6
7
8
9
10
                                            Job Title  \
0                     Senior Python Software Engineer   
1                                AI Engineer (Intern)   
2                   Internship - Software Development   
3     Software Engineer III, Core Development, Python   
4                                  Software Developer   
..                                                ...   
95                              AI Specialist (m/f/d)   
96  Backend Developer (m/w/d) – PHP / Web Applicat...   
97  Deutsche Bank Graduate Programme (f/m/x) in Be...   
98                       AI Platform Engineer (m/f/d)   
99  Python Developer for Desktop Applications (m/f/d)   

                                                 Link  
0   https://de.linkedin.com/jobs/view/senior-pytho...  
1   https://de.linkedin.com/jobs/view/ai-engineer-...  
2   https://de.linkedin.com/jobs/view/internship-s...  
3   https://de.linkedin.com/jobs/view/software-eng...  
4   https://

## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [12]:
# your code here

from selenium import webdriver
import time
import pandas as pd


def scrape_linkedin_job_search_3(country):

    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome()
    driver.maximize_window()
    # Open a web page
    driver.get("https://www.linkedin.com")

    jobs_button = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="guest_homepage-basic_guest_nav_menu_jobs"]')[0]
    jobs_button.click()
    time.sleep(1)

    sign_in_button = driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="public_jobs_contextual-sign-in-modal_modal_dismiss"]')[0]
    sign_in_button.click()
    time.sleep(1)

    country_input = driver.find_elements(by='xpath', value='//input[@id="job-search-bar-location"]')[0]
    time.sleep(1)
    country_input.clear()
    country_input.send_keys(country)
    time.sleep(1)

    job_input = driver.find_elements(by='xpath', value='//input[@data-tracking-control-name="public_jobs_dismissable-input"]')[0]
    job_input.send_keys("python")
    time.sleep(1)
    job_list = driver.find_elements(by='xpath', value='//li[@id="keywords-1"]')[0]
    job_list.click()

    count = 0
    # while loop to search for more than the initial set of jobs
    while (not driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="infinite-scroller_show-more"]')) or count<=10:
        
        count += 1
        # upward scroll is necessary so linked in allows the next lazyload batch
        driver.execute_script("window.scrollBy(0, -300);")

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

    job_posts = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="public_jobs_jserp-result_search-card"]')
    time.sleep(1)

    job_names = []
    #company_names = []
    job_links = []

    for job_post in job_posts:
        if job_post.get_attribute('href'):
            job_links.append(job_post.get_attribute('href'))
        else:
            job_links.append(None)

        if job_post.find_element(by='xpath', value='.//span'):
            job_names.append(job_post.find_element(by='xpath', value='.//span').text)
        else:
            job_names.append(None)

        #if job_post.find_element(by='xpath', value='.//h4'):
        #    company_names.append(job_post.find_element(by='xpath', value='.//h4').text)
        #else:
        #    company_names.append(None)
        
    time.sleep(2)
    driver.quit()

    df = pd.DataFrame({"Job Title": job_names, "Link": job_links})
    return df

In [13]:
df = scrape_linkedin_job_search_3("England")

print(df)

                                     Job Title  \
0                     Junior Software Engineer   
1   Graduate Software Engineer 2025 - Platform   
2           Backend Software Engineer (Python)   
3                    Machine Learning Engineer   
4    Graduate Software Engineer 2025 - RegTech   
..                                         ...   
85                            Python Developer   
86                              Data Scientist   
87                  Graduate Software Engineer   
88                            Python Developer   
89                                 AI Engineer   

                                                 Link  
0   https://uk.linkedin.com/jobs/view/junior-softw...  
1   https://uk.linkedin.com/jobs/view/graduate-sof...  
2   https://uk.linkedin.com/jobs/view/backend-soft...  
3   https://uk.linkedin.com/jobs/view/machine-lear...  
4   https://uk.linkedin.com/jobs/view/graduate-sof...  
..                                                ...  
85  htt

## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [28]:
# your code here

from selenium import webdriver
import time
import pandas as pd


def scrape_linkedin_job_search_4(country, num_days):
    
    linkedin_timeparameter = 60*60*24*num_days
   
    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome()
    driver.maximize_window()
    # Open a web page
    driver.get("https://www.linkedin.com")

    # navigate to jobs
    jobs_button = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="guest_homepage-basic_guest_nav_menu_jobs"]')[0]
    jobs_button.click()
    time.sleep(1)

    # Click X button of sign In pop up
    sign_in_button = driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="public_jobs_contextual-sign-in-modal_modal_dismiss"]')[0]
    sign_in_button.click()
    time.sleep(1)

    

    # enter Country
    country_input = driver.find_elements(by='xpath', value='//input[@id="job-search-bar-location"]')[0]
    time.sleep(1)
    country_input.clear()
    country_input.send_keys(country)
    time.sleep(1)

    job_input = driver.find_elements(by='xpath', value='//input[@data-tracking-control-name="public_jobs_dismissable-input"]')[0]
    job_input.send_keys("python")
    time.sleep(1)
    job_list = driver.find_elements(by='xpath', value='//li[@id="keywords-1"]')[0]
    job_list.click()

    # change Timespan in URL
    current_url = driver.current_url
    new_url = current_url + "&f_TPR=r" + str(linkedin_timeparameter)
    driver.get(new_url)


    count = 0
    # while loop to search for more than the initial set of jobs
    while (not driver.find_elements(by='xpath', value='//button[@data-tracking-control-name="infinite-scroller_show-more"]')) or count<=10:
        
        count += 1
        # upward scroll is necessary so linked in allows the next lazyload batch
        driver.execute_script("window.scrollBy(0, -300);")

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

    job_posts = driver.find_elements(by='xpath', value='//a[@data-tracking-control-name="public_jobs_jserp-result_search-card"]')
    time.sleep(1)

    job_names = []
    #company_names = []
    job_links = []

    for job_post in job_posts:
        if job_post.get_attribute('href'):
            job_links.append(job_post.get_attribute('href'))
        else:
            job_links.append(None)

        if job_post.find_element(by='xpath', value='.//span'):
            job_names.append(job_post.find_element(by='xpath', value='.//span').text)
        else:
            job_names.append(None)

        #if job_post.find_element(by='xpath', value='.//h4'):
        #    company_names.append(job_post.find_element(by='xpath', value='.//h4').text)
        #else:
        #    company_names.append(None)
        
    time.sleep(2)
    driver.quit()

    df = pd.DataFrame({"Job Title": job_names, "Link": job_links})
    return df

In [27]:
df = scrape_linkedin_job_search_4("Germany", 5)

print(df)

                                            Job Title  \
0                                  Software Developer   
1                  Backend Software Engineer (Python)   
2                Data Scientist (all genders welcome)   
3                    Research Scientist | Language AI   
4   Graduate Software Developer - STEM / Physics /...   
..                                                ...   
95                              SRE / DevOps Engineer   
96  DevOps Engineer (Software-Oriented) - Luxembou...   
97  Postdoctoral researcher modeling and assessing...   
98                   (Senior) DevOps Engineer (f/m/d)   
99  IBM Z Software Entwickler (m/w/x) für Systems ...   

                                                 Link  
0   https://de.linkedin.com/jobs/view/software-dev...  
1   https://de.linkedin.com/jobs/view/backend-soft...  
2   https://de.linkedin.com/jobs/view/data-scienti...  
3   https://de.linkedin.com/jobs/view/research-sci...  
4   https://de.linkedin.com/jobs/vi

## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [None]:
# your code here