# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [7]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import urllib.parse
from tqdm import tqdm 


def scrape_linkedin_job_search(keywords, max_pages=15, amount=200):
    driver = webdriver.Chrome()
    url = f'https://www.linkedin.com/jobs/search/?keywords={keywords}'
    driver.get(url)
    '''
    # Close modal popup
    try:
        WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='Dismiss']"))
        ).click()
        print(" ▸ Modal window closed")
    except:
        print(" ▸ Window not found")
    '''
    collected_urns = set()
    data = []

    try:
        for page in range(max_pages):
            # Dynamic wait for element updates
            WebDriverWait(driver, 15).until(
                lambda d: len(d.find_elements(By.CSS_SELECTOR, "div.base-card.relative")) > len(collected_urns)*0.8
            )
            
            # Parse page
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            jobs = soup.select('div.base-card.relative')
            
            new_jobs = 0
            for job in jobs:
                try:
                    urn = job.get('data-entity-urn', '').split(':')[-1]
                    if urn and urn not in collected_urns:
                        # Extract link
                        link_tag = job.select_one('a.base-card__full-link')
                        link = link_tag['href'].split('?')[0] if link_tag else 'N/A'
                        
                        data.append({
                            'urn': urn,
                            'title': job.select_one('span.sr-only').get_text(strip=True),
                            'company': job.select_one('h4.base-search-card__subtitle a').get_text(strip=True),
                            'location': job.select_one('span.job-search-card__location').get_text(strip=True),
                            'link': link
                        })
                        collected_urns.add(urn)
                        new_jobs += 1
                                    # Early exit if limit reached
                    if len(data) >= amount:
                        break
                except Exception as e:
                    continue

            print(f" ▸ Page {page+1}: Added {new_jobs} jobs | Total: {len(data)}")

            # scrolling
            driver.execute_script("""
                window.scrollTo({
                    top: document.body.scrollHeight - 500,
                    behavior: 'smooth'
                });
            """)
            
            # Enhanced "See more" button search
            try:
                more_btn = WebDriverWait(driver, 5).until(
                    EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'See more jobs')]"))
                )
                driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", more_btn)
                driver.execute_script("arguments[0].click();", more_btn)
                time.sleep(random.uniform(2.5, 4.0))
            except:
                # Alternative scrolling if button not found
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(1.5)

            # Early exit if limit reached
            if len(data) >= amount:
                break

    finally:
        driver.quit()

    df = pd.DataFrame(data).drop_duplicates('urn')
    print(f"\nFinal result: {len(df)} jobs")
    return df

In [46]:
# Execution
df = scrape_linkedin_job_search('data%20scientist')
df.head()

 ▸ Modal window closed
 ▸ Page 1: Added 60 jobs | Total: 60
 ▸ Page 2: Added 0 jobs | Total: 60
 ▸ Page 3: Added 20 jobs | Total: 80
 ▸ Page 4: Added 20 jobs | Total: 100
 ▸ Page 5: Added 10 jobs | Total: 110
 ▸ Page 6: Added 10 jobs | Total: 120
 ▸ Page 7: Added 10 jobs | Total: 130
 ▸ Page 8: Added 10 jobs | Total: 140
 ▸ Page 9: Added 10 jobs | Total: 150
 ▸ Page 10: Added 10 jobs | Total: 160
 ▸ Page 11: Added 10 jobs | Total: 170
 ▸ Page 12: Added 10 jobs | Total: 180
 ▸ Page 13: Added 10 jobs | Total: 190
 ▸ Page 14: Added 10 jobs | Total: 200

Final result: 200 jobs


Unnamed: 0,urn,title,company,location,link
0,4208340911,Data Scientist (L5) - App QoE,Netflix,United States,https://www.linkedin.com/jobs/view/data-scient...
1,4211733043,Data Scientist I,Pinterest,United States,https://www.linkedin.com/jobs/view/data-scient...
2,4208158460,"Data Scientist, Product Analytics",Meta,"Chicago, IL",https://www.linkedin.com/jobs/view/data-scient...
3,4208950495,"Data Scientist, Product, Sustainability",Google,"San Francisco, CA",https://www.linkedin.com/jobs/view/data-scient...
4,4211646000,Data Scientist (L5) - Netflix Preview Club,Netflix,United States,https://www.linkedin.com/jobs/view/data-scient...


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [56]:
# The LinkedIn job search script uses Selenium to scrape job postings. When run without authentication, it cannot paginate through results by page number.

from IPython.display import display, HTML


# Function for clickable links
def link_formatter(url):
    return f'<a href="{url}" target="_blank" style="color: blue; text-decoration: underline;">{url}</a>'

# Apply styling to combined DataFrame
styled = df.sample(10).sort_index().style.format({'link': link_formatter})

display(HTML(styled.to_html(escape=False)))

Unnamed: 0,urn,title,company,location,link
20,4209157110,Research Scientist,IBM,"Yorktown Heights, NY",https://www.linkedin.com/jobs/view/research-scientist-at-ibm-4209157110
34,4215143631,"Data Scientist, Analytics - Tiktok",TikTok,"San Jose, CA",https://www.linkedin.com/jobs/view/data-scientist-analytics-tiktok-at-tiktok-4215143631
79,4215961727,Data Scientist Intern,Triple Ring Technologies,"Boston, MA",https://www.linkedin.com/jobs/view/data-scientist-intern-at-triple-ring-technologies-4215961727
83,4217622572,"Senior Data Scientist, Growth and Acquisition",Cash App,United States,https://www.linkedin.com/jobs/view/senior-data-scientist-growth-and-acquisition-at-cash-app-4217622572
104,4216720911,Data Scientist II,Ascendion,"Santa Cruz, CA",https://www.linkedin.com/jobs/view/data-scientist-ii-at-ascendion-4216720911
149,4211684121,Data Scientist,Johnson & Johnson MedTech,"Santa Clara, CA",https://www.linkedin.com/jobs/view/data-scientist-at-johnson-johnson-medtech-4211684121
150,4207708801,"Senior Scientist, Translational Research",Bristol Myers Squibb,"Cambridge, MA",https://www.linkedin.com/jobs/view/senior-scientist-translational-research-at-bristol-myers-squibb-4207708801
162,4217605109,Data Science Specialist,Techgene Solutions,"Texas, United States",https://www.linkedin.com/jobs/view/data-science-specialist-at-techgene-solutions-4217605109
164,4210156840,Machine Learning Engineer,mbue,"Austin, Texas Metropolitan Area",https://www.linkedin.com/jobs/view/machine-learning-engineer-at-mbue-4210156840
178,4218626584,Machine Learning Scientist,Strativ Group,United States,https://www.linkedin.com/jobs/view/machine-learning-scientist-at-strativ-group-4218626584


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [57]:
# your code here
# Search for "Data Scientist" jobs in Germany
df = scrape_linkedin_job_search('data%20scientist&location=Germany&geoId=101282230')
styled = df.sample(10).sort_index().style.format({'link': link_formatter})
display(HTML(styled.to_html(escape=False)))

 ▸ Window not found
 ▸ Page 1: Added 60 jobs | Total: 60
 ▸ Page 2: Added 0 jobs | Total: 60
 ▸ Page 3: Added 20 jobs | Total: 80
 ▸ Page 4: Added 20 jobs | Total: 100
 ▸ Page 5: Added 10 jobs | Total: 110
 ▸ Page 6: Added 10 jobs | Total: 120
 ▸ Page 7: Added 10 jobs | Total: 130
 ▸ Page 8: Added 10 jobs | Total: 140
 ▸ Page 9: Added 10 jobs | Total: 150
 ▸ Page 10: Added 10 jobs | Total: 160
 ▸ Page 11: Added 10 jobs | Total: 170
 ▸ Page 12: Added 10 jobs | Total: 180
 ▸ Page 13: Added 10 jobs | Total: 190
 ▸ Page 14: Added 10 jobs | Total: 200

Final result: 200 jobs


Unnamed: 0,urn,title,company,location,link
0,4202669828,FrontendDeveloper React,Instaffo,"Kiel, Schleswig-Holstein, Germany",https://de.linkedin.com/jobs/view/frontenddeveloper-react-at-instaffo-4202669828
34,4207840901,Data Scientist d/f/m,RWE,"Essen, North Rhine-Westphalia, Germany",https://de.linkedin.com/jobs/view/data-scientist-d-f-m-at-rwe-4207840901
39,3996395665,Machine Learning Engineer (m/w/d),natif.ai,"Saarbrücken, Saarland, Germany",https://de.linkedin.com/jobs/view/machine-learning-engineer-m-w-d-at-natif-ai-3996395665
41,4192452639,Machine Learning Engineer,Almedia,"Berlin, Germany",https://de.linkedin.com/jobs/view/machine-learning-engineer-at-almedia-4192452639
76,4181100972,Data Scientist (m/f/x),sevdesk,"Berlin, Berlin, Germany",https://de.linkedin.com/jobs/view/data-scientist-m-f-x-at-sevdesk-4181100972
108,4200346042,AI/LLM Engineer,Briink,"Berlin, Berlin, Germany",https://de.linkedin.com/jobs/view/ai-llm-engineer-at-briink-4200346042
182,4011979245,Data Scientist (w/m/d),Capgemini,"Hamburg, Hamburg, Germany",https://de.linkedin.com/jobs/view/data-scientist-w-m-d-at-capgemini-4011979245
184,4201024363,AI/LLM Engineer (m/w/d),T-Systems International,"Munich, Bavaria, Germany",https://de.linkedin.com/jobs/view/ai-llm-engineer-m-w-d-at-t-systems-international-4201024363
186,4192445022,Senior Data Scientist (m/w/d) - Eviden,Eviden,"Frankfurt am Main, Hesse, Germany",https://de.linkedin.com/jobs/view/senior-data-scientist-m-w-d-eviden-at-eviden-4192445022
190,4214977276,"(Senior) Data Scientist (Analytics), Merchant",Wolt,"Berlin, Berlin, Germany",https://de.linkedin.com/jobs/view/senior-data-scientist-analytics-merchant-at-wolt-4214977276


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [58]:
# your code here
df = scrape_linkedin_job_search('data%20scientist&location=Germany&geoId=101282230&f_TPR=r259200')
styled = df.sample(10).sort_index().style.format({'link': link_formatter})
display(HTML(styled.to_html(escape=False)))

 ▸ Window not found
 ▸ Page 1: Added 60 jobs | Total: 60
 ▸ Page 2: Added 0 jobs | Total: 60
 ▸ Page 3: Added 20 jobs | Total: 80
 ▸ Page 4: Added 20 jobs | Total: 100
 ▸ Page 5: Added 10 jobs | Total: 110
 ▸ Page 6: Added 9 jobs | Total: 119
 ▸ Page 7: Added 9 jobs | Total: 128
 ▸ Page 8: Added 10 jobs | Total: 138
 ▸ Page 9: Added 10 jobs | Total: 148
 ▸ Page 10: Added 10 jobs | Total: 158
 ▸ Page 11: Added 10 jobs | Total: 168
 ▸ Page 12: Added 10 jobs | Total: 178
 ▸ Page 13: Added 10 jobs | Total: 188
 ▸ Page 14: Added 10 jobs | Total: 198
 ▸ Page 15: Added 10 jobs | Total: 208

Final result: 208 jobs


Unnamed: 0,urn,title,company,location,link
54,4200136991,Senior Data Scientist / Data Engineer Capital Markets m/w/d,DZ BANK AG,"Frankfurt am Main, Hesse, Germany",https://de.linkedin.com/jobs/view/senior-data-scientist-data-engineer-capital-markets-m-w-d-at-dz-bank-ag-4200136991
60,4215506883,Data Scientist Arabic NLP,Fraunhofer IAIS,"Sankt Augustin, North Rhine-Westphalia, Germany",https://de.linkedin.com/jobs/view/data-scientist-arabic-nlp-at-fraunhofer-iais-4215506883
69,4216769772,Data Scientist AIOPs (m/w/d),Telefónica Germany,Greater Munich Metropolitan Area,https://de.linkedin.com/jobs/view/data-scientist-aiops-m-w-d-at-telef%C3%B3nica-germany-4216769772
83,4217615813,"Junior GenAI Entwickler:in, remote (m/w/d)",dreifach.ai,"Cologne, North Rhine-Westphalia, Germany",https://de.linkedin.com/jobs/view/junior-genai-entwickler-in-remote-m-w-d-at-dreifach-ai-4217615813
91,4214287238,Hackers wanted – Software Engineers,freiheit.com technologies,"Hamburg, Germany",https://de.linkedin.com/jobs/view/hackers-wanted-%E2%80%93-software-engineers-at-freiheit-com-technologies-4214287238
93,4217421131,HIL Resident Engineer in Computer Vision,Technology & Strategy,"Munich, Bavaria, Germany",https://de.linkedin.com/jobs/view/hil-resident-engineer-in-computer-vision-at-technology-strategy-4217421131
146,4158617515,Senior Fullstack Engineer,Huspy,Germany,https://de.linkedin.com/jobs/view/senior-fullstack-engineer-at-huspy-4158617515
198,4217290088,Staff AI Engineer (m/f/d),IU International University of Applied Sciences,"Hamburg, Hamburg, Germany",https://de.linkedin.com/jobs/view/staff-ai-engineer-m-f-d-at-iu-international-university-of-applied-sciences-4217290088
199,4217252759,Werkstudium / Praxissemester Web Development,FarmAct,"Augsburg, Bavaria, Germany",https://de.linkedin.com/jobs/view/werkstudium-praxissemester-web-development-at-farmact-4217252759
200,4214968882,"Frontend Engineer, German-Speaking (C1/C2)",Muxon,"Neu Ulm, Bavaria, Germany",https://de.linkedin.com/jobs/view/frontend-engineer-german-speaking-c1-c2-at-muxon-4214968882


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [8]:
import requests
from bs4 import BeautifulSoup

def get_seniority_level(job_id):
    url = f"https://www.linkedin.com/jobs/view/{job_id}/"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        
        criteria_items = soup.find_all('li', class_='description__job-criteria-item')
        
        for item in criteria_items:
            header = item.find('h3', class_='description__job-criteria-subheader')
            if header and 'Seniority level' in header.text:
                value = item.find('span', class_='description__job-criteria-text')
                return value.get_text(strip=True) if value else 'Not Specified'
        
        return 'Not Found'
    
    except Exception as e:
        print(f"Error: {str(e)}")
        return 'Error'


In [86]:
from tqdm import tqdm
tqdm.pandas()
df['seniority_level'] = df['urn'].progress_apply(get_seniority_level)

 32%|███▏      | 66/208 [01:11<02:06,  1.12it/s]

Error: 429 Client Error: Request denied for url: https://www.linkedin.com/jobs/view/4218607127/


100%|██████████| 208/208 [03:52<00:00,  1.12s/it]


In [87]:
df

Unnamed: 0,urn,title,...,link,seniority_level
0,4201663959,Data Scientist (Global Ranking),...,https://de.linkedin.com/jobs/view/data-scienti...,Mid-Senior level
1,4217454324,Junior Data Scientist (f/m/d),...,https://de.linkedin.com/jobs/view/junior-data-...,Entry level
2,4217259778,Machine Learning Engineer (f/m/d),...,https://de.linkedin.com/jobs/view/machine-lear...,Mid-Senior level
3,4180273755,Data Scientist (Global Ranking),...,https://de.linkedin.com/jobs/view/data-scienti...,Mid-Senior level
4,4199115609,Data Scientist,...,https://de.linkedin.com/jobs/view/data-scienti...,Mid-Senior level
...,...,...,...,...,...
203,4214738941,Full Stack Developer,...,https://de.linkedin.com/jobs/view/full-stack-d...,Mid-Senior level
204,4215630634,Projektmitarbeiter / Doktorand (w/m/d) im Bere...,...,https://de.linkedin.com/jobs/view/projektmitar...,Internship
205,4199451843,Software Engineer (w/m/d),...,https://de.linkedin.com/jobs/view/software-eng...,Associate
206,4218183663,Software Engineer (m/w/d),...,https://de.linkedin.com/jobs/view/software-eng...,Entry level


In [88]:
df.to_csv('jobs_with_seniority.csv', index=False)

In [10]:
def search_jobs(job_title: str, country: str, days_ago: int, amount: int) -> pd.DataFrame:
    """
    Parameters:
    job_title (str): Job title to search for (e.g., "Data Scientist")
    country (str): Country to search in (supported: Germany, France, UK, USA)
    days_ago (int): Maximum age of job postings in days
    amount (int): Maximum number of jobs to collect
    
    Returns:
    pd.DataFrame: Dataset containing job listings
    """
    # Dictionary of geoIds for countries
    country_geo_ids = {
        'Germany': '101282230',
        'France': '105015875',
        'UK': '101165590',
        'USA': '103644278'
    }
    
    # Encode the job title
    encoded_title = urllib.parse.quote(job_title)
    
    # Get geoId
    geo_id = country_geo_ids.get(country, '')
    if not geo_id:
        raise ValueError(f"Unknown country: {country}. Available: {list(country_geo_ids.keys())}")
    
    # Convert days to seconds
    seconds_ago = days_ago * 86400
    
    # Build search parameters string
    search_query = (
        f"{encoded_title}"
        f"&location={country}"
        f"&geoId={geo_id}"
        f"&f_TPR=r{seconds_ago}"
    )
    
    # Call the main scraping function
    df = scrape_linkedin_job_search(
        search_query,
        amount=amount
    )
    
    # Add progress bar for seniority level extraction
    tqdm.pandas()
    df['seniority_level'] = df['urn'].progress_apply(get_seniority_level)

    return df

# Example usage
df = search_jobs(
    job_title="Data Analyst", 
    country="Germany", 
    days_ago=5,
    amount=20
)

df

 ▸ Page 1: Added 20 jobs | Total: 20

Final result: 20 jobs


100%|██████████| 20/20 [00:19<00:00,  1.03it/s]


Unnamed: 0,urn,title,company,location,link,seniority_level
0,4179891938,Quantitative Analyst,CGS,"Berlin, Germany",https://de.linkedin.com/jobs/view/quantitative...,Entry level
1,4212356107,Data Analyst (m/w/d),NOZ Digital,"Osnabrück, Lower Saxony, Germany",https://de.linkedin.com/jobs/view/data-analyst...,Entry level
2,4214254291,Business Intelligence Analyst,developrec,Frankfurt Rhine-Main Metropolitan Area,https://de.linkedin.com/jobs/view/business-int...,Mid-Senior level
3,4157932614,"Analytics and Insight Professional, Amazon Adv...",Amazon,"Munich, Bavaria, Germany",https://de.linkedin.com/jobs/view/analytics-an...,Not Applicable
4,4214199129,Data Analyst (m/w/d) Agree21,Harvey Nash,"Stuttgart, Baden-Württemberg, Germany",https://de.linkedin.com/jobs/view/data-analyst...,Associate
5,4214257057,Business Intelligence Analyst,developrec,"Stuttgart, Baden-Württemberg, Germany",https://de.linkedin.com/jobs/view/business-int...,Mid-Senior level
6,4214964813,Industry Insights Analyst - Marktforschung,NielsenIQ,"Frankfurt, Hesse, Germany",https://de.linkedin.com/jobs/view/industry-ins...,Mid-Senior level
7,4173642803,Data Analyst*,Alnatura,"Hesse, Germany",https://de.linkedin.com/jobs/view/data-analyst...,Associate
8,4215101436,People Analytics Specialist (m/w/d),Inditex,"Hamburg, Hamburg, Germany",https://de.linkedin.com/jobs/view/people-analy...,Entry level
9,4214784630,Data Analyst (Gaming) (f/m/d),Sunday,"Hamburg, Hamburg, Germany",https://de.linkedin.com/jobs/view/data-analyst...,Mid-Senior level
