# Scrape LinkedIn Using Selenium, Request and Beautiful Soup in Python

We are going to scrape Linkedin Jobs. More specifically, the following details will be scraped:
- Job Id
- Job title
- Seniority Level
- Location
- Job description
- number of candidats
- posted time ago

1. To scrape Job Ids, we will use `selenium` to navigate to this URL: `https://www.linkedin.com/jobs/search?`.

`chromedriver` executable and your LinkedIn credentials are required here.

3. As explained [here](https://www.scrapingdog.com/blog/scrape-linkedin-jobs/), to scrape other details (level, description...), we will use a simple GET request (leveraging the `requests` library) to this URL: `https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/xxxxxx` where xxxxxx is the job ID.\
This is easier than using clicks from `selenium`.

Note that I tried to scrape Job IDs using the guest URL `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?` but the results were imprecise.

# I-Scraping Linkedin Jobs IDs using selenium and BeautifulSoup

In [1]:
# pip install selenium 
# pip install beautifulsoup4

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import requests

import time, datetime
import pandas as pd
import numpy as np
import math, re, sys
import warnings
warnings.filterwarnings("ignore")

## Login to Linkedin using selenium

🔑 **Note:**

1. To use `selenium`, we need a web driver. For instance, the `chromedriver` can be downloaded from [here](https://chromedriver.chromium.org/downloads).\
   Next, we add the chromedriver to the project directory.

3. Linkedin credentials (email address and password) are also required. You can save them here: `../data/user_credentials.txt`

In [2]:
# Get User Credentials
with open('user_credentials.txt', 'r',encoding="utf-8") as file:
    user_credentials = file.readlines()
    user_credentials = [line.rstrip() for line in user_credentials]
    
my_email,my_pwd = user_credentials[0],user_credentials[1]
my_email,my_pwd

('bt4222project@gmail.com', 'bt4222project')

In [3]:
# 1. Instanciate the chrome service
chromedriver_path = "C:/Users/Jason/Desktop/IS3107/chromedriver.exe"
service = Service(executable_path=chromedriver_path)

# 2. Instanciate the webdriver
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options, service=service)

# 3. Open the LinkedIn login page
driver.get('https://www.linkedin.com/login')
time.sleep(5) # waiting for the page to load

# 4. Enter email address & password
email_input = driver.find_element(By.ID, 'username')
password_input = driver.find_element(By.ID, 'password')
email_input.send_keys(my_email)
password_input.send_keys(my_pwd)

# 5. Click the login button
password_input.send_keys(Keys.ENTER)

time.sleep(10)

We will be logged into LinkedIn after running the above code.

## Scraping Linkedin Jobs IDs

1. Set the search query parameters: keywords (ie. Job title) and location;
2. Search results are displayed on many pages: `25` jobs are listed on each page;
3. We will navigate to every page using the `start` parameter (0,25,50...);
4. We need to scroll to the bottom of the page to load the full data;
5. To get Job Ids, we will parse the HTML content of the page using BeautifulSoup.

In [4]:
List_Job_IDs = []

In [5]:
# Create a function 'Scroll to the bottom'. 

# time.sleep() function is used to provide extra time for the webpage to load. 
# I used 120 seconds. If the 25 jobs have not loaded during this period, we can make adjust it and test again.

def scroll_to_bottom(driver,sleep_time=120):
    last_height = driver.execute_script('return document.body.scrollHeight')
    while True:
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        new_height = driver.execute_script('return document.body.scrollHeight')
        if new_height == last_height:
            break
        last_height = new_height
    
    time.sleep(sleep_time)  

In [6]:
# Navigate to the first page (start=0) and scroll to the bottom of the page

keywords = 'data%20scientist'
location = 'singapore'
start = 0

url = f'https://www.linkedin.com/jobs/search/?keywords={keywords}&location={location}&start={start}'
url = requests.utils.requote_uri(url)
driver.get(url)
scroll_to_bottom(driver,sleep_time=120)

In [7]:
# Get number of jobs found and number of pages:

# Parse the HTML content of the page using BeautifulSoup.
soup = BeautifulSoup(driver.page_source, 'html.parser')

try:
    div_number_of_jobs = soup.find("div",{"class":"jobs-search-results-list__subtitle"})
    number_of_jobs = int(div_number_of_jobs.find('span').get_text().strip().split()[0])
except:
    number_of_jobs = 0
    
number_of_pages=math.ceil(number_of_jobs/25)
print("number_of_jobs:",number_of_jobs)
print("number_of_pages:",number_of_pages)

number_of_jobs: 352
number_of_pages: 15


In [8]:
# Get Job Ids present on the first page.

def find_Job_Ids(soup):

    Job_Ids_on_the_page = []
    
    job_postings = soup.find_all('li', {'class': 'jobs-search-results__list-item'})
    for job_posting in job_postings:
        Job_ID = job_posting.get('data-occludable-job-id')
        Job_Ids_on_the_page.append(Job_ID)
        # job_title = job_posting.find('a', class_='job-card-list__title').get_text().strip()
        # location = job_posting.find('li', class_='job-card-container__metadata-item').get_text().strip()
    
    return Job_Ids_on_the_page    

Jobs_on_this_page = find_Job_Ids(soup)
List_Job_IDs.extend(Jobs_on_this_page)

Now that we've scraped the job IDs and number of results from the first page, let's iterate over the remaining pages.

### Iterate over the remaining pages

In [9]:
if number_of_pages>1:
    
    for page_num in range(1,number_of_pages):
        print(f"Scraping page: {page_num}",end="...")
        
        # Navigate to page
        url = f'https://www.linkedin.com/jobs/search/?keywords={keywords}&location={location}&start={25 * page_num}'
        url = requests.utils.requote_uri(url)
        driver.get(url)
        scroll_to_bottom(driver)

        # Parse the HTML content of the page using BeautifulSoup.
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        # Get Job Ids present on the page.
        Jobs_on_this_page = find_Job_Ids(soup)
        List_Job_IDs.extend(Jobs_on_this_page)  
        print(f'Jobs found:{len(Jobs_on_this_page)}')

pd.DataFrame({"Job_Id":List_Job_IDs}).to_csv('Job_Ids.csv',index=False)

Scraping page: 1...Jobs found:25
Scraping page: 2...Jobs found:25
Scraping page: 3...Jobs found:25
Scraping page: 4...Jobs found:25
Scraping page: 5...Jobs found:25
Scraping page: 6...Jobs found:25
Scraping page: 7...Jobs found:25
Scraping page: 8...Jobs found:25
Scraping page: 9...Jobs found:25
Scraping page: 10...Jobs found:25
Scraping page: 11...Jobs found:25
Scraping page: 12...Jobs found:25
Scraping page: 13...Jobs found:25
Scraping page: 14...Jobs found:25


In [10]:
## Close the browser and shut down the ChromiumDriver executable that
# is started when starting the ChromiumDriver. 
driver.quit()

## Scraping Job description using requests and BeautifulSoup
https://www.scrapingdog.com/blog/scrape-linkedin-jobs/

In [11]:
import requests
from bs4 import BeautifulSoup

list_job_IDs = pd.read_csv("Job_Ids.csv").Job_Id.to_list()

In [12]:
list_job_IDs = list_job_IDs
list_job_IDs

[3885140325,
 3617503376,
 3616276879,
 3885132588,
 3547792602,
 3888353382,
 3902211450,
 3900653425,
 3902319179,
 3811541744,
 3716074825,
 3321618896,
 3858679337,
 3897727248,
 3883628491,
 3824809934,
 3893993655,
 3900667624,
 3888065503,
 3876418041,
 3827310674,
 3844348914,
 3846687869,
 3873756357,
 3838661122,
 3879816912,
 3776593342,
 3614427281,
 3804727263,
 3804234540,
 3866074146,
 3876413806,
 3888479980,
 3881929168,
 3684049712,
 3883454550,
 3872111318,
 3817509349,
 3901984356,
 3747251947,
 3884300091,
 3902890993,
 3726889696,
 3846687408,
 3890448969,
 3888589081,
 3877331901,
 3900500571,
 3881142244,
 3876060212,
 3628611242,
 3847969552,
 3631285002,
 3742413272,
 3747983415,
 3748160217,
 3864126544,
 3844339802,
 3902336096,
 3745188988,
 3896872612,
 3848121871,
 3870741744,
 3885124968,
 3898788642,
 3878971862,
 3902212356,
 3747026637,
 3888900324,
 3899676102,
 3871730910,
 3872871033,
 3890434503,
 3693536674,
 3868203682,
 3636350941,
 3681755416,

In [13]:
def remove_tags(html):
    '''remove html tags from BeautifulSoup.text'''
 
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
 
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
 
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)

In [14]:
job_url='https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/{}'
job={}
list_jobs=[]

for j in range(0,len(list_job_IDs)):
    print(f"{j+1} ... read jobId:{list_job_IDs[j]}")

    resp = requests.get(job_url.format(list_job_IDs[j]))
    soup=BeautifulSoup(resp.text,'html.parser')
    # print(soup.prettify()) 

    job["Job_ID"] = list_job_IDs[j] 
    # try:
    #     job["Job_html"] = resp.content
    # except:
    #     job["Job_html"]=None

    try: # remove tags
        job["Job_txt"] = remove_tags(resp.content)
    except:
        job["Job_txt"] = None
    
    try:
        job["company"]=soup.find("div",{"class":"top-card-layout__card"}).find("a").find("img").get('alt')
    except:
        job["company"]=None

    try:
        job["job-title"]=soup.find("div",{"class":"top-card-layout__entity-info"}).find("a").text.strip()
    except:
        job["job-title"]=None

    try:
        job["level"]=soup.find("ul",{"class":"description__job-criteria-list"}).find("li").text.replace("Seniority level","").strip()
    except:
        job["level"]=None

    try:
        job["location"]=soup.find("span",{"class":"topcard__flavor topcard__flavor--bullet"}).text.strip()
    except:
        job["location"]=None

    try:
        job["posted-time-ago"]=soup.find("span",{"class":"posted-time-ago__text topcard__flavor--metadata"}).text.strip()
    except:
        job["posted-time-ago"]=None

    try:
        nb_candidats = soup.find("span",{"class":"num-applicants__caption topcard__flavor--metadata topcard__flavor--bullet"}).text.strip()
        nb_candidats = int(nb_candidats.split()[0])
        job["nb_candidats"]= nb_candidats
    except:
        job["nb_candidats"]=None

    list_jobs.append(job)
    job={}

# create a pandas Datadrame
jobs_DF = pd.DataFrame(list_jobs)

1 ... read jobId:3885140325
2 ... read jobId:3617503376
3 ... read jobId:3616276879
4 ... read jobId:3885132588
5 ... read jobId:3547792602
6 ... read jobId:3888353382
7 ... read jobId:3902211450
8 ... read jobId:3900653425
9 ... read jobId:3902319179
10 ... read jobId:3811541744
11 ... read jobId:3716074825
12 ... read jobId:3321618896
13 ... read jobId:3858679337
14 ... read jobId:3897727248
15 ... read jobId:3883628491
16 ... read jobId:3824809934
17 ... read jobId:3893993655
18 ... read jobId:3900667624
19 ... read jobId:3888065503
20 ... read jobId:3876418041
21 ... read jobId:3827310674
22 ... read jobId:3844348914
23 ... read jobId:3846687869
24 ... read jobId:3873756357
25 ... read jobId:3838661122
26 ... read jobId:3879816912
27 ... read jobId:3776593342
28 ... read jobId:3614427281
29 ... read jobId:3804727263
30 ... read jobId:3804234540
31 ... read jobId:3866074146
32 ... read jobId:3876413806
33 ... read jobId:3888479980
34 ... read jobId:3881929168
35 ... read jobId:36840

In [15]:
jobs_DF.head()

Unnamed: 0,Job_ID,Job_txt,company,job-title,level,location,posted-time-ago,nb_candidats
0,3885140325,Data Scientist - Commodities Selby Jennings Si...,Selby Jennings,Data Scientist - Commodities,Mid-Senior level,"Singapore, Singapore",1 week ago,
1,3617503376,"Data Scientist (Algo), Paid Ads TikTok Singapo...",TikTok,"Data Scientist (Algo), Paid Ads",Not Applicable,Singapore,1 month ago,
2,3616276879,,,,,,,
3,3885132588,Senior Data Scientist SixSense Singapore 1 wee...,SixSense,Senior Data Scientist,Mid-Senior level,Singapore,1 week ago,
4,3547792602,,,,,,,


🔑 **Note:**

Now we have scraped all Linkedin Job details. 

The next step is to process the data:
1. Create a posted_date column using posted_time_ago;
2. Clean up 'Job_description' (remove sentences like "Remove photo First name Last name Email Password ( 8 + characters )");
3. Clean up the 'level' column.

## Process data

In [16]:
def clean_Job_description(text):
    senetences_to_remove = ["Remove photo First name Last name Email Password (8+ characters) ",
                            "By clicking Agree & Join",
                            "you agree to the LinkedIn User Agreement",
                            "Privacy Policy and Cookie Policy",
                            "Continue Agree & Join or Apply on company website",
                            "Security verification",
                            "Close Already on LinkedIn ?",
                            "Close Already on LinkedIn?",
                            "Sign in Save Save job Save this job with your existing LinkedIn profile , or create a new one",
                            "Sign in Save Save job Save this job with your existing LinkedIn profile, or create a new one",
                            "Your job seeking activity is only visible to you",
                            "Email Continue Welcome back"]
    for sentence in senetences_to_remove:
        result = text.find(sentence)
        if result>-1:
            text = text[:result] + text[result+len(sentence):] # remove sentence from text

    return text 

In [17]:
def get_posted_date(posted_time_ago,date_scraping):
    """Convert posted_time_ago to number of days.
    For example, 1 month ago is replaced by 30. 1 week by 7 and so on..."""
    posted_date = None
    
    try:
        details = posted_time_ago.split()
        N_DAYS_AGO = int(details[0])
        day_week_month_year = details[1] 
        if day_week_month_year.startswith("day"):
            N_DAYS_AGO = N_DAYS_AGO
        elif day_week_month_year.startswith("week"):
            N_DAYS_AGO = N_DAYS_AGO*7
        elif day_week_month_year.startswith("month"):
            N_DAYS_AGO = N_DAYS_AGO*30
        elif day_week_month_year.startswith("year"):
            N_DAYS_AGO = N_DAYS_AGO*365
        else:
            N_DAYS_AGO = None

        posted_date = date_scraping - datetime.timedelta(days=N_DAYS_AGO)
    except:
        posted_date = None

    return posted_date

In [18]:
jobs_DF['scraping_date'] = pd.to_datetime(datetime.date.today())
jobs_DF['posted_date'] = np.vectorize(get_posted_date)(jobs_DF['posted-time-ago'], jobs_DF['scraping_date'])

jobs_DF['Job_txt'] = jobs_DF['Job_txt'].apply(clean_Job_description)
jobs_DF.level = jobs_DF.level.apply(lambda x:x.replace("Employment type\n        \n\n          ","") if x is not None else x)

jobs_DF.head()

Unnamed: 0,Job_ID,Job_txt,company,job-title,level,location,posted-time-ago,nb_candidats,scraping_date,posted_date
0,3885140325,Data Scientist - Commodities Selby Jennings Si...,Selby Jennings,Data Scientist - Commodities,Mid-Senior level,"Singapore, Singapore",1 week ago,,2024-04-21,2024-04-14
1,3617503376,"Data Scientist (Algo), Paid Ads TikTok Singapo...",TikTok,"Data Scientist (Algo), Paid Ads",Not Applicable,Singapore,1 month ago,,2024-04-21,2024-03-22
2,3616276879,,,,,,,,2024-04-21,NaT
3,3885132588,Senior Data Scientist SixSense Singapore 1 wee...,SixSense,Senior Data Scientist,Mid-Senior level,Singapore,1 week ago,,2024-04-21,2024-04-14
4,3547792602,,,,,,,,2024-04-21,NaT


## Save to json file

In [19]:
jobs_DF.to_json("dataScientist_scraped.json")
jobs_DF.to_csv("dataScientist_scraped.csv")