# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [39]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    records = []
    for card in soup.select("div.job-search-card"):
        title = card.select("h3", recursive=False)[0].get_text().strip()
        company = card.select("h4", recursive=False)[0].get_text().strip()
        location = card.select("span.job-search-card__location")[0].get_text().strip()
        titles.append(title)
        companies.append(company)
        locations.append(location)

        records.append({
            "Title": title,
            "Company": company,
            "Location": location
        })
    
    # Return dataframe
    return pd.DataFrame(records, columns=["Title", "Company", "Location"])

In [40]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,Title,Company,Location
0,Research Data Analyst,UCLA,"Los Angeles, CA"
1,Data Analyst,Petroplan,"Houston, TX"
2,Data Scientist,The Value Maximizer,"New Jersey, United States"
3,Data Scientist,The Value Maximizer,"Pennsylvania, United States"
4,Business Analyst,Nestlé,"Solon, OH"
5,Data Science Intern (Winter or Summer 2026),Notion,"New York, NY"
6,Trading Analyst,Cantor Fitzgerald,New York City Metropolitan Area
7,Data Analyst I,City National Bank,"Los Angeles, CA"
8,Data Science Intern (Winter or Summer 2026),Notion,"San Francisco, CA"
9,Data Engineer - Entry Level,InterWorks,"Portland, OR"


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [45]:
def scrape_linkedin_job_search(keywords, num_pages):
    records = []
    for page in range(num_pages):
        # Define the base url to be scraped.
        # All uppercase variable name signifies this is a constant and its value should never unchange
        try:
            BASE_URL = 'https://www.linkedin.com/jobs/search/?'
            
            # Assemble the full url with parameters
            scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&start=", str(page*25)])
        
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
            soup = BeautifulSoup(page.text, 'html.parser')
        
            # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
            # Then in each job card, extract the job title, company, and location data.
            titles = []
            companies = []
            locations = []
            for card in soup.select("div.job-search-card"):
                title = card.select("h3", recursive=False)[0].get_text().strip()
                company = card.select("h4", recursive=False)[0].get_text().strip()
                location = card.select("span.job-search-card__location")[0].get_text().strip()
                titles.append(title)
                companies.append(company)
                locations.append(location)
        
                records.append({
                    "Title": title,
                    "Company": company,
                    "Location": location
                })
        except:
            break
            
        
        # Return dataframe
    return pd.DataFrame(records, columns=["Title", "Company", "Location"])

In [46]:
results = scrape_linkedin_job_search('data%20analysis', 5)
results

Unnamed: 0,Title,Company,Location
0,Research Data Analyst,UCLA,"Los Angeles, CA"
1,Data Analyst,Petroplan,"Houston, TX"
2,Data Scientist,The Value Maximizer,"New Jersey, United States"
3,Data Scientist,The Value Maximizer,"Pennsylvania, United States"
4,Data Science Intern (Winter or Summer 2026),Notion,"New York, NY"
...,...,...,...
295,Sr. Data Analyst,Verse Medical,"New York, NY"
296,Data Scientist,Visa,"Foster City, CA"
297,Data Analyst,World Wide Technology,"Georgia, United States"
298,Data Analyst,World Wide Technology,"Maryland, United States"


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [47]:
def scrape_linkedin_job_search(keywords, num_pages, country):
    records = []
    for page in range(num_pages):
        # Define the base url to be scraped.
        # All uppercase variable name signifies this is a constant and its value should never unchange
        try:
            BASE_URL = 'https://www.linkedin.com/jobs/search/?'
            
            # Assemble the full url with parameters
            scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&location=", country, "&start=", str(page*25)])
        
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
            soup = BeautifulSoup(page.text, 'html.parser')
        
            # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
            # Then in each job card, extract the job title, company, and location data.
            titles = []
            companies = []
            locations = []
            for card in soup.select("div.job-search-card"):
                title = card.select("h3", recursive=False)[0].get_text().strip()
                company = card.select("h4", recursive=False)[0].get_text().strip()
                location = card.select("span.job-search-card__location")[0].get_text().strip()
                titles.append(title)
                companies.append(company)
                locations.append(location)
        
                records.append({
                    "Title": title,
                    "Company": company,
                    "Location": location
                })
        except:
            break
            
        
        # Return dataframe
    return pd.DataFrame(records, columns=["Title", "Company", "Location"])

In [50]:
results = scrape_linkedin_job_search('data%20analysis', 5, "Germany")
results

Unnamed: 0,Title,Company,Location
0,Equity Research,Institute of Finance and Economics (IFE),Germany
1,Data Analyst,Happy Mammoth,"Berlin, Berlin, Germany"
2,"Product Analyst - (Logistics, Customer)",Delivery Hero,"Berlin, Berlin, Germany"
3,Data Analyst (m/f/d),TOPdesk,"Kaiserslautern, Rhineland-Palatinate, Germany"
4,Strategic researcher (m/f/d),Omio,"Berlin, Berlin, Germany"
...,...,...,...
275,Director of Artificial Intelligence,AI Futures,Cologne Bonn Region
276,Business Analyst (m/w/d),"Swell, Inc.","Bad Schwalbach, Hesse, Germany"
277,(Junior) Project & Data Analyst - LSCM (m/f/d),Bridgestone EMEA,"Frankfurt am Main, Hesse, Germany"
278,Investment Analyst | Single Family Office,THRONSBERG | Private Capital Recruitment,"Hamburg, Germany"


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [52]:
def scrape_linkedin_job_search(keywords, num_pages, country, num_days):
    records = []
    for page in range(num_pages):
        # Define the base url to be scraped.
        # All uppercase variable name signifies this is a constant and its value should never unchange
        try:
            BASE_URL = 'https://www.linkedin.com/jobs/search/?'
            
            # Assemble the full url with parameters
            scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&location=", country, "&f_TPR=r", str(num_days*86400), "&start=", str(page*25)])
        
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
            soup = BeautifulSoup(page.text, 'html.parser')
        
            # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
            # Then in each job card, extract the job title, company, and location data.
            titles = []
            companies = []
            locations = []
            for card in soup.select("div.job-search-card"):
                title = card.select("h3", recursive=False)[0].get_text().strip()
                company = card.select("h4", recursive=False)[0].get_text().strip()
                location = card.select("span.job-search-card__location")[0].get_text().strip()
                titles.append(title)
                companies.append(company)
                locations.append(location)
        
                records.append({
                    "Title": title,
                    "Company": company,
                    "Location": location
                })
        except:
            break
            
        
        # Return dataframe
    return pd.DataFrame(records, columns=["Title", "Company", "Location"])

In [53]:
results = scrape_linkedin_job_search('data%20analysis', 2, "Germany", 2)
results

Unnamed: 0,Title,Company,Location
0,Equity Research,Institute of Finance and Economics (IFE),Germany
1,"Product Analyst - (Logistics, Customer)",Delivery Hero,"Berlin, Berlin, Germany"
2,Strategic researcher (m/f/d),Omio,"Berlin, Berlin, Germany"
3,Equity Research Intern,Institute of Finance and Economics (IFE),Germany
4,Marketing Data Analyst (m/f/d),Pflegia,"Berlin, Berlin, Germany"
...,...,...,...
109,AI Innovator,CEF.AI,"Munich, Bavaria, Germany"
110,Praktikum Strategy und M&A (m/w/d),Flip,"Stuttgart, Baden-Württemberg, Germany"
111,Prozessingenieur (all gender),ALTEN Germany,Greater Munich Metropolitan Area
112,Senior Research Engineer,DeepRec.ai,Greater Munich Metropolitan Area


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [119]:
def scrape_linkedin_job_search(keywords, num_pages, country, num_days):
    records = []
    for page in range(num_pages):
        # Define the base url to be scraped.
        # All uppercase variable name signifies this is a constant and its value should never unchange
        try:
            BASE_URL = 'https://www.linkedin.com/jobs/search/?'
            
            # Assemble the full url with parameters
            scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&location=", country, "&f_TPR=r", str(num_days*86400), "&start=", str(page*25)])
        
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
            soup = BeautifulSoup(page.text, 'html.parser')
        
            # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
            # Then in each job card, extract the job title, company, and location data.
            titles = []
            companies = []
            locations = []
            for card in soup.select("div.job-search-card"):
                title = card.select("h3", recursive=False)[0].get_text().strip()
                company = card.select("h4", recursive=False)[0].get_text().strip()
                location = card.select("span.job-search-card__location")[0].get_text().strip()
                titles.append(title)
                companies.append(company)
                locations.append(location)
                
                currentjobid = card["data-entity-urn"].split(":")[-1]
                job_url = ''.join(["https://www.linkedin.com/jobs/view/", currentjobid])
                page = requests.get(job_url)
                soup = BeautifulSoup(page.text, 'html.parser')
                seniority = soup.select("li.description__job-criteria-item h3")[0].parent.select("span")[0].get_text().strip()
                
                records.append({
                    "Title": title,
                    "Company": company,
                    "Location": location,
                    "Seniority level":seniority
                })
        except:
            break
            
        
        # Return dataframe
    return pd.DataFrame(records, columns=["Title", "Company", "Location", "Seniority level"])

In [117]:
results = scrape_linkedin_job_search('data%20analysis', 2, "Germany", 2)
results

Unnamed: 0,Title,Company,Location,Seniority level
0,Equity Research,Institute of Finance and Economics (IFE),Germany,Internship
1,"Product Analyst - (Logistics, Customer)",Delivery Hero,"Berlin, Berlin, Germany",Mid-Senior level
2,Strategic researcher (m/f/d),Omio,"Berlin, Berlin, Germany",Internship
3,Equity Research Intern,Institute of Finance and Economics (IFE),Germany,Internship
4,Marketing Data Analyst (m/f/d),Pflegia,"Berlin, Berlin, Germany",Associate
...,...,...,...,...
109,AI Innovator,CEF.AI,"Munich, Bavaria, Germany",Associate
110,Praktikum Strategy und M&A (m/w/d),Flip,"Stuttgart, Baden-Württemberg, Germany",Not Applicable
111,Prozessingenieur (all gender),ALTEN Germany,Greater Munich Metropolitan Area,Associate
112,Senior Research Engineer,DeepRec.ai,Greater Munich Metropolitan Area,Mid-Senior level
