# 🕸️ **LinkedIn Real-time Data Scraper Exmaple ** 🕸️

This notebook is designed to scrape real-time job listings from LinkedIn while maintaining a narrow focus to avoid triggering anti-bot measures. The goal is to collect approximately 1000 relevant job postings that meet specific criteria. This approach helps reduce the risk of being blocked while ensuring the data is fresh and aligned with current job market trends. The scraper is built to handle dynamic content, and adjustments can be made to refine search parameters for efficient and safe data extraction.


## 1. Package Install

In [1]:
!pip install requests beautifulsoup4 selenium webdriver-manager


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 2. Generate URL:

### currentJobId=4022915975:
Explanation: This represents the unique ID for the current job being viewed. The job with this ID is specifically highlighted in the results.
Purpose: LinkedIn uses this parameter to identify which job posting is being actively viewed or accessed by the user.
### distance=25:
Explanation: Defines the search radius around the job location. In this case, it’s set to 25 miles.
Purpose: This parameter is used to filter the job search results based on a geographical distance from the selected location.
### origin=JOBS_HOME_KEYWORD_HISTORY:
Explanation: Describes how the user arrived at the job search page. JOBS_HOME_KEYWORD_HISTORY indicates the user navigated to this search through the job search history on LinkedIn's homepage.
Purpose: This parameter helps LinkedIn track the user's path to the current search and provides insight into user behavior.
### refresh=true:
Explanation: Indicates whether the page should refresh. true specifies that the search results should be refreshed.
Purpose: This is used by LinkedIn to request updated search results.

### keywords=Data%20Engineer:
Explanation: Specifies the search term for job roles. In this case, it is searching for jobs with the term "Data Engineer." %20 is the encoded representation of a space character.
Purpose: This parameter is used to filter the jobs based on keywords relevant to job titles or descriptions.

The Boolean Search Terms
https://dripify.io/boolean-search-on-linkedin/
The main Boolean search terms are:
AND: This operator combines two or more keywords and returns results that contain all the keywords.
For example, a search for “sales AND marketing” will return results that contain both “sales” and “marketing.”

OR: This operator is used to combine two or more keywords and return results that contain at least one of the keywords. 
For example, a search for “sales OR marketing” will return results that contain either “sales” or “marketing,” or both.

NOT: This operator excludes certain keywords from the search results. 
For example, a search for “sales NOT marketing” will return results that contain “sales” but not “marketing.”

” “ : This operator is used to search for an exact phrase. 
For example, a search for “ “sales manager” ” will return results that contain the exact phrase “sales manager.”

( ) : This operator is used to group keywords together and to apply the operator to a specific set of keywords. 
For example, a search for ” (sales OR marketing) NOT (manager OR director) “ will return results that contain either “sales” or “marketing” but not “manager” or “director.”


In [2]:
def get_title(prompt, default_keywords = "computer%20science"):
    """
    Prompt the user for input and return the value. 
    If the user input is blank, return the default value.
    """
    user_input = input(f"{prompt} (Default: {default_keywords}): ").strip()
    return default_keywords if user_input == '' else user_input
keywords = get_user_input("Please enter your preferred job title")

### geoId=118424786:
Explanation: Represents a unique geographical identifier. Each location (city, region, or country) has a distinct geoId in LinkedIn's system.
Purpose: LinkedIn uses this parameter to specify the location for the job search. In this case, 118424786 refers to a specific geographic region.
| ** index ** | ** geoId ** | ** Location **  |
|-----|--------|----------|
| 0 | 101174742 | Canada |
| 1 | 105149290 | Ontario, Canada |
| 2 | 100025096 | Toronto, Ontario, Canada |
| 3 | 101788145 | Mississauga, Ontario, Canada |


In [3]:
geoIds = [101174742, 105149290, 100025096, 101788145]

### f_E=1%2C2%2C3%2C4%2C5%2C6:
Explanation: This parameter represents the different experience levels in LinkedIn's job search system. Each number corresponds to a specific experience level, allowing users to filter job searches based on their desired career stage. 
Purpose: LinkedIn uses this parameter to filter job search results by experience level. If all experience levels are selected, the parameter appears as f_E=1%2C2%2C3%2C4%2C5%2C6 in the URL. If no experience level is selected, f_E is not included in the URL.

| **Index** | **f_E Value** | **Experience Level**    |
|-----------|---------------|------------------------|
| 0         | 1             | Internship             |
| 1         | 2             | Entry Level            |
| 2         | 3             | Associate              |
| 3         | 4             | Mid-Senior Level       |
| 4         | 5             | Director               |
| 5         | 6             | Executive              |


In [4]:
def get_experience_level_url():
    """
    Asks the user to input their desired experience levels and constructs the URL parameter for 'f_E'.
    If no experience levels are selected, it will not include 'f_E' in the URL.
    """
    experience_levels = {
        "1": "Internship",
        "2": "Entry level",
        "3": "Associate",
        "4": "Mid-Senior level",
        "5": "Director",
        "6": "Executive"
    }
    
    print("Select your experience levels (separate multiple choices with commas, e.g., '1,3,4'):")
    for key, value in experience_levels.items():
        print(f"{key}: {value}")
    
    user_input = input("Your selection (leave blank for no filter): ").strip()
    
    if user_input == "":
        return ""  # No 'f_E' in the URL if the user makes no selection
    
    selected_levels = [level.strip() for level in user_input.split(',') if level.strip() in experience_levels]
    
    if selected_levels:
        return f"f_E={'%2C'.join(selected_levels)}"
    else:
        return ""

### f_WT=1%2C3%2C2
**Explanation**: Represents the work type filter in LinkedIn's job search. Each number corresponds to a specific work type, allowing users to filter job searches based on their preferred work arrangement.
**Purpose**: LinkedIn uses this parameter to filter job search results by work type. If multiple work types are selected, they will be represented as `f_WT=1%2C3%2C2` in the URL. If no work type is selected, `f_WT` is not included in the URL.

| **Index** | **f_WT Value** | **Work Type** |
|-----------|----------------|--------------|
| 0         | 1              | On-site      |
| 1         | 2              | Hybrid       |
| 2         | 3              | Remote       |


In [5]:
def get_work_type_url():
    """
    Asks the user to input their preferred work types and constructs the URL parameter for 'f_WT'.
    If no work types are selected, it will not include 'f_WT' in the URL.
    """
    work_types = {
        "1": "On-site",
        "2": "Hybrid",
        "3": "Remote"
    }
    
    print("Select your work types (separate multiple choices with commas, e.g., '1,2,3'):")
    for key, value in work_types.items():
        print(f"{key}: {value}")
    
    user_input = input("Your selection (leave blank for no filter): ").strip()
    
    if user_input == "":
        return ""  # No 'f_WT' in the URL if the user makes no selection
    
    selected_types = [type_.strip() for type_ in user_input.split(',') if type_.strip() in work_types]
    print(selected_types)
    if selected_types:
        return f"f_WT={'%2C'.join(selected_types)}"
    else:
        return ""


Select your work types (separate multiple choices with commas, e.g., '1,2,3'):
1: On-site
2: Hybrid
3: Remote
URL without work type filter: https://www.linkedin.com/jobs/search


### f_TPR=r2592000, f_TPR=r604800, f_TPR=r86400
**Explanation**: The `f_TPR` parameter is used to filter job postings based on the time they were posted. Each value represents a different time range.
**Purpose**: LinkedIn uses this parameter to filter job search results by how recently they were posted.

| **Index** | **f_TPR Value** | **Time Range**    |
|-----------|------------------|-------------------|
| 1         | r86400           | Past 24 hours     |
| 2         | r604800          | Past week         |
| 3         | r2592000         | Past month        |
| 4         | (None)           | Any time          |

- If the user selects "Any time," the `f_TPR` parameter is not included in the URL.
- The URL would look like: `https://www.linkedin.com/jobs/search?f_TPR=r2592000` if "Past month" is selected.


In [6]:
def get_time_posted_filter_url():
    """
    Asks the user to input their preferred time range for job postings and constructs the URL parameter for 'f_TPR'.
    If no time range is selected, it will not include 'f_TPR' in the URL.
    """
    time_ranges = {
        "1": ("r86400", "Past 24 hours"),
        "2": ("r604800", "Past week"),
        "3": ("r2592000", "Past month"),
        "4": ("", "Any time")
    }
    
    print("Select your preferred time range for job postings (choose one):")
    for key, value in time_ranges.items():
        print(f"{key}: {value[1]}")
    
    user_input = input("Your selection (leave blank for 'Any time'): ").strip()
    
    if user_input == "" or user_input not in time_ranges:
        return ""  # No 'f_TPR' in the URL if the user makes no selection or selects "Any time"
    
    selected_time_range = time_ranges[user_input][0]
    
    if selected_time_range:
        return f"f_TPR={selected_time_range}"
    else:
        return ""

### f_SB2=21 - 29
**Explanation**: The `f_SB2` parameter is used to filter job postings based on the salary range. Each value corresponds to a specific salary range, allowing users to filter jobs by their preferred salary.

| **Index** | **f_SB2 Value** | **Salary Range** |
|-----------|-----------------|------------------|
| 1         | 21              | $40,000+         |
| 2         | 22              | $60,000+         |
| 3         | 23              | $80,000+         |
| 4         | 24              | $100,000+        |
| 5         | 25              | $120,000+        |
| 6         | 26              | $140,000+        |
| 7         | 27              | $160,000+        |
| 8         | 28              | $180,000+        |
| 9         | 29              | $200,000+        |

- If no salary range is selected, `f_SB2` will not be included in the URL.
- The URL would look like: `https://www.linkedin.com/jobs/search?f_SB2=24` if "$100,000+" is selected.


In [7]:
def get_salary_filter_url():
    """
    Asks the user to input their preferred salary range for job postings and constructs the URL parameter for 'f_SB2'.
    If no salary range is selected, it will not include 'f_SB2' in the URL.
    """
    salary_ranges = {
        "1": ("21", "$40,000+"),
        "2": ("22", "$60,000+"),
        "3": ("23", "$80,000+"),
        "4": ("24", "$100,000+"),
        "5": ("25", "$120,000+"),
        "6": ("26", "$140,000+"),
        "7": ("27", "$160,000+"),
        "8": ("28", "$180,000+"),
        "9": ("29", "$200,000+")
    }
    
    print("Select your preferred salary range (choose one):")
    for key, value in salary_ranges.items():
        print(f"{key}: {value[1]}")
    
    user_input = input("Your selection (leave blank for no filter): ").strip()
    
    if user_input == "" or user_input not in salary_ranges:
        return ""  # No 'f_SB2' in the URL if the user makes no selection
    
    selected_salary_range = salary_ranges[user_input][0]
    
    return f"f_SB2={selected_salary_range}"

### f_JT=F%2CP%2CT
**Explanation**: The `f_JT` parameter is used to filter job postings based on the job type. Each value corresponds to a specific job type, allowing users to filter jobs by their preferred job arrangement.
**Purpose**: LinkedIn uses this parameter to filter job search results by job type. Multiple job types can be selected and represented in the URL using `f_JT`.

| **Index** | **f_JT Value** | **Job Type**   |
|-----------|----------------|----------------|
| 1         | F              | Full-time      |
| 2         | P              | Part-time      |
| 3         | C              | Contract       |
| 4         | T              | Temporary      |
| 5         | V              | Volunteer      |
| 6         | I              | Internship     |
| 7         | O              | Other          |

- If no job type is selected, `f_JT` will not be included in the URL.
- For example, selecting "Full-time" and "Part-time" results in: `https://www.linkedin.com/jobs/search?f_JT=F%2CP`.


In [8]:
def get_job_type_filter_url():
    """
    Asks the user to input their preferred job types (multiple selections allowed) and constructs the URL parameter for 'f_JT'.
    If no job type is selected, it will not include 'f_JT' in the URL.
    """
    job_types = {
        "1": ("F", "Full-time"),
        "2": ("P", "Part-time"),
        "3": ("C", "Contract"),
        "4": ("T", "Temporary"),
        "5": ("V", "Volunteer"),
        "6": ("I", "Internship"),
        "7": ("O", "Other")
    }
    
    print("Select your preferred job types (you can select one or multiple by separating choices with commas, e.g., '1,2,3'):")
    for key, value in job_types.items():
        print(f"{key}: {value[1]}")
    
    user_input = input("Your selection (leave blank for no filter): ").strip()
    
    if user_input == "":
        return ""  # No 'f_JT' in the URL if the user makes no selection
    
    selected_types = [job_types[choice.strip()][0] for choice in user_input.split(',') if choice.strip() in job_types]
    
    if selected_types:
        return f"f_JT={'%2C'.join(selected_types)}"
    else:
        return ""

# 假设之前已经定义了这些函数:
# get_experience_level_url(), get_work_type_url(), get_time_posted_filter_url(),
# get_salary_filter_url(), get_job_type_filter_url()

def build_linkedin_job_search_url_interactive():
    """
    Uses previously defined functions to obtain each filter value, combines them,
    and builds the final LinkedIn job search URL.
    
    Returns:
    - str: The final LinkedIn job search URL.
    """
    base_url = "https://www.linkedin.com/jobs/search"
    
    # Call previously defined functions to get each parameter
    f_E = get_experience_level_url()   # Example function for experience levels
    f_WT = get_work_type_url()         # Example function for work types
    f_TPR = get_time_posted_filter_url()  # Example function for time posted
    f_SB2 = get_salary_filter_url()    # Example function for salary range
    f_JT = get_job_type_filter_url()   # Example function for job types
    
    # Combine all filter parameters into a dictionary
    filters = {
        "f_E": f_E,
        "f_WT": f_WT,
        "f_TPR": f_TPR,
        "f_SB2": f_SB2,
        "f_JT": f_JT
    }
    
    # Construct the URL
    query_params = [f"{key}={value}" for key, value in filters.items() if value]
    
    final_url = base_url
    if query_params:
        final_url += '?' + '&'.join(query_params)
    
    return final_url

print("Final LinkedIn Job Search URL:", build_linkedin_job_search_url_interactive())


## 3. Scrap Example:

In [6]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time

In [30]:
# 配置 Chrome 浏览器选项
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")

# 初始化 ChromeDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# 打开 LinkedIn 登录页面
driver.get("https://www.linkedin.com/login")

# 点击登录按钮
driver.find_element(By.XPATH, "//button[@type='submit']").click()

# 等待登录完成并跳转到主页面
time.sleep(5)

# 登录成功后获取 Cookie
cookies = driver.get_cookies()

# 打印并保存 Cookies
for cookie in cookies:
    print(f"{cookie['name']} = {cookie['value']}")

# 如果需要保存 Cookie，可存储为文件或数据库
# 例如，将它们写入文件：
import json
with open('cookies.json', 'w') as f:
    json.dump(cookies, f)

# 你可以使用获取到的 cookies 来进行后续的操作
# 打开 LinkedIn 职位页面
driver.get("https://www.linkedin.com/jobs/search?keywords=Data%20Engineer&location=Worldwide")

time.sleep(5)
print(driver.page_source)

# 关闭浏览器
# driver.quit()


<html lang="en"><head>
        <meta name="pageKey" content="d_homepage-guest-home">
<!----><!---->        <meta name="locale" content="en_US">
        <meta id="config" data-app-version="2.1.1682" data-call-tree-id="AAYijD0TxJA3N1DBOr3bYA==" data-jet-tags="guest-homepage" data-multiproduct-name="homepage-guest-frontend" data-service-name="homepage-guest-frontend" data-browser-id="a9d80b4f-a174-4c1f-8941-e5136a363a3d" data-enable-page-view-heartbeat-tracking="" data-page-instance="urn:li:page:d_homepage-guest-home;H7KYTyKzTyC4lYVgcKK+wA==" data-disable-jsbeacon-pagekey-suffix="false" data-member-id="0" data-dna-member-lix-treatment="control" data-human-member-lix-treatment="control" data-dfp-member-lix-treatment="control">

        <link rel="canonical" href="https://www.linkedin.com/">
          <link rel="alternate" hreflang="de" href="https://de.linkedin.com/">
          <link rel="alternate" hreflang="en-IE" href="https://ie.linkedin.com/">
          <link rel="alternate" hreflang=