# Scrape LinkedIn Using Selenium, Request and Beautiful Soup in Python

We are going to scrape Linkedin Jobs. More specifically, the following details will be scraped:
- Job Id
- Job title
- Seniority Level
- Location
- Job description
- number of candidats
- posted time ago

1. To scrape Job Ids, we will use `selenium` to navigate to this URL: `https://www.linkedin.com/jobs/search?`.

`chromedriver` executable and your LinkedIn credentials are required here.

3. As explained [here](https://www.scrapingdog.com/blog/scrape-linkedin-jobs/), to scrape other details (level, description...), we will use a simple GET request (leveraging the `requests` library) to this URL: `https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/xxxxxx` where xxxxxx is the job ID.\
This is easier than using clicks from `selenium`.

Note that I tried to scrape Job IDs using the guest URL `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?` but the results were imprecise.

# I-Scraping Linkedin Jobs IDs using selenium and BeautifulSoup

In [6]:
# pip install selenium 
# pip install beautifulsoup4

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import requests

import time, datetime
import pandas as pd
import numpy as np
import math, re, sys
import warnings
warnings.filterwarnings("ignore")

## Login to Linkedin using selenium

🔑 **Note:**

1. To use `selenium`, we need a web driver. For instance, the `chromedriver` can be downloaded from [here](https://chromedriver.chromium.org/downloads).\
   Next, we add the chromedriver to the project directory.

3. Linkedin credentials (email address and password) are also required. You can save them here: `../data/user_credentials.txt`

In [7]:
# Get User Credentials
with open('../data/user_credentials.txt', 'r',encoding="utf-8") as file:
    user_credentials = file.readlines()
    user_credentials = [line.rstrip() for line in user_credentials]
    
my_email,my_pwd = user_credentials[0],user_credentials[1]
my_email,my_pwd

('email_address', 'password')

In [None]:
# 1. Instanciate the chrome service
chromedriver_path = '../chromedriver/chromedriver.exe'
service = Service(executable_path=chromedriver_path)

# 2. Instanciate the webdriver
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options, service=service)

# 3. Open the LinkedIn login page
driver.get('https://www.linkedin.com/login')
time.sleep(5) # waiting for the page to load

# 4. Enter email address & password
email_input = driver.find_element(By.ID, 'username')
password_input = driver.find_element(By.ID, 'password')
email_input.send_keys(my_email)
password_input.send_keys(my_pwd)

# 5. Click the login button
password_input.send_keys(Keys.ENTER)

time.sleep(10)

We will be logged into LinkedIn after running the above code.

## Scraping Linkedin Jobs IDs

1. Set the search query parameters: keywords (ie. Job title) and location;
2. Search results are displayed on many pages: `25` jobs are listed on each page;
3. We will navigate to every page using the `start` parameter (0,25,50...);
4. We need to scroll to the bottom of the page to load the full data;
5. To get Job Ids, we will parse the HTML content of the page using BeautifulSoup.

In [4]:
List_Job_IDs = []

In [5]:
# Create a function 'Scroll to the bottom'. 

# time.sleep() function is used to provide extra time for the webpage to load. 
# I used 120 seconds. If the 25 jobs have not loaded during this period, we can make adjust it and test again.

def scroll_to_bottom(driver,sleep_time=120):
    last_height = driver.execute_script('return document.body.scrollHeight')
    while True:
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        new_height = driver.execute_script('return document.body.scrollHeight')
        if new_height == last_height:
            break
        last_height = new_height
    
    time.sleep(sleep_time)  

In [None]:
# Navigate to the first page (start=0) and scroll to the bottom of the page

keywords = 'data%20scientist'
location = 'Montreal%2C%20Quebec%2C%20Canada'
start = 0

url = f'https://www.linkedin.com/jobs/search/?keywords={keywords}&location={location}&start={start}'
url = requests.utils.requote_uri(url)
driver.get(url)
scroll_to_bottom(driver,sleep_time=120)

In [11]:
# Get number of jobs found and number of pages:

# Parse the HTML content of the page using BeautifulSoup.
soup = BeautifulSoup(driver.page_source, 'html.parser')

try:
    div_number_of_jobs = soup.find("div",{"class":"jobs-search-results-list__subtitle"})
    number_of_jobs = int(div_number_of_jobs.find('span').get_text().strip().split()[0])
except:
    number_of_jobs = 0
    
number_of_pages=math.ceil(number_of_jobs/25)
print("number_of_jobs:",number_of_jobs)
print("number_of_pages:",number_of_pages)

number_of_jobs: 260
number_of_pages: 11


In [12]:
# Get Job Ids present on the first page.

def find_Job_Ids(soup):

    Job_Ids_on_the_page = []
    
    job_postings = soup.find_all('li', {'class': 'jobs-search-results__list-item'})
    for job_posting in job_postings:
        Job_ID = job_posting.get('data-occludable-job-id')
        Job_Ids_on_the_page.append(Job_ID)
        # job_title = job_posting.find('a', class_='job-card-list__title').get_text().strip()
        # location = job_posting.find('li', class_='job-card-container__metadata-item').get_text().strip()
    
    return Job_Ids_on_the_page    

Jobs_on_this_page = find_Job_Ids(soup)
List_Job_IDs.extend(Jobs_on_this_page)

Now that we've scraped the job IDs and number of results from the first page, let's iterate over the remaining pages.

### Iterate over the remaining pages

In [14]:
if number_of_pages>1:
    
    for page_num in range(1,number_of_pages):
        print(f"Scraping page: {page_num}",end="...")
        
        # Navigate to page
        url = f'https://www.linkedin.com/jobs/search/?keywords={job_title}&location={location}&start={25 * page_num}'
        url = requests.utils.requote_uri(url)
        driver.get(url)
        scroll_to_bottom(driver)

        # Parse the HTML content of the page using BeautifulSoup.
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        # Get Job Ids present on the page.
        Jobs_on_this_page = find_Job_Ids(soup)
        List_Job_IDs.extend(Jobs_on_this_page)  
        print(f'Jobs found:{len(Jobs_on_this_page)}')

pd.DataFrame({"Job_Id":List_Job_IDs}).to_csv('../data/Job_Ids.csv',index=False)

Scraping page: 1...Jobs found:25
Scraping page: 2...Jobs found:25
Scraping page: 3...Jobs found:25
Scraping page: 4...Jobs found:25
Scraping page: 5...Jobs found:25
Scraping page: 6...Jobs found:25
Scraping page: 7...Jobs found:25
Scraping page: 8...Jobs found:25
Scraping page: 9...Jobs found:25
Scraping page: 10...Jobs found:8


In [15]:
## Close the browser and shut down the ChromiumDriver executable that
# is started when starting the ChromiumDriver. 
driver.quit()

## Scraping Job description using requests and BeautifulSoup
https://www.scrapingdog.com/blog/scrape-linkedin-jobs/

In [3]:
import requests
from bs4 import BeautifulSoup

list_job_IDs = pd.read_csv("../data/Job_Ids.csv").Job_Id.to_list()

In [26]:
list_job_IDs = list_job_IDs[:5]
list_job_IDs

[3768445795, 3766874608, 3636842773, 3765556140, 3743029150]

In [2]:
def remove_tags(html):
    '''remove html tags from BeautifulSoup.text'''
 
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
 
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
 
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)

In [37]:
job_url='https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/{}'
job={}
list_jobs=[]

for j in range(0,len(list_job_IDs)):
    print(f"{j+1} ... read jobId:{list_job_IDs[j]}")

    resp = requests.get(job_url.format(list_job_IDs[j]))
    soup=BeautifulSoup(resp.text,'html.parser')
    # print(soup.prettify()) 

    job["Job_ID"] = list_job_IDs[j] 
    # try:
    #     job["Job_html"] = resp.content
    # except:
    #     job["Job_html"]=None

    try: # remove tags
        job["Job_txt"] = remove_tags(resp.content)
    except:
        job["Job_txt"] = None
    
    try:
        job["company"]=soup.find("div",{"class":"top-card-layout__card"}).find("a").find("img").get('alt')
    except:
        job["company"]=None

    try:
        job["job-title"]=soup.find("div",{"class":"top-card-layout__entity-info"}).find("a").text.strip()
    except:
        job["job-title"]=None

    try:
        job["level"]=soup.find("ul",{"class":"description__job-criteria-list"}).find("li").text.replace("Seniority level","").strip()
    except:
        job["level"]=None

    try:
        job["location"]=soup.find("span",{"class":"topcard__flavor topcard__flavor--bullet"}).text.strip()
    except:
        job["location"]=None

    try:
        job["posted-time-ago"]=soup.find("span",{"class":"posted-time-ago__text topcard__flavor--metadata"}).text.strip()
    except:
        job["posted-time-ago"]=None

    try:
        nb_candidats = soup.find("span",{"class":"num-applicants__caption topcard__flavor--metadata topcard__flavor--bullet"}).text.strip()
        nb_candidats = int(nb_candidats.split()[0])
        job["nb_candidats"]= nb_candidats
    except:
        job["nb_candidats"]=None

    list_jobs.append(job)
    job={}

# create a pandas Datadrame
jobs_DF = pd.DataFrame(list_jobs)

1 ... read jobId:3768445795
2 ... read jobId:3766874608
3 ... read jobId:3636842773
4 ... read jobId:3765556140
5 ... read jobId:3743029150
6 ... read jobId:3765912486
7 ... read jobId:3755769264
8 ... read jobId:3738828224
9 ... read jobId:3717903353
10 ... read jobId:3766879037
11 ... read jobId:3765709395
12 ... read jobId:3676014327
13 ... read jobId:3771807918
14 ... read jobId:3768668734
15 ... read jobId:3770793104
16 ... read jobId:3768182240
17 ... read jobId:3768141217
18 ... read jobId:3713157521
19 ... read jobId:3771365625
20 ... read jobId:3765392730
21 ... read jobId:3645689495
22 ... read jobId:3759367245
23 ... read jobId:3772194033
24 ... read jobId:3764345835
25 ... read jobId:3711769725
26 ... read jobId:3706108079
27 ... read jobId:3762359345
28 ... read jobId:3709558547
29 ... read jobId:3764677184
30 ... read jobId:3759706595
31 ... read jobId:3717081904
32 ... read jobId:3684313985
33 ... read jobId:3760886556
34 ... read jobId:3737836614
35 ... read jobId:37605

In [40]:
jobs_DF.head()

Unnamed: 0,Job_ID,Job_txt,company,job-title,level,location,posted-time-ago,nb_candidats
0,3768445795,Data Scientist / Scientifique des Données Brai...,BrainFinance,Data Scientist / Scientifique des Données,Mid-Senior level,"Montreal, Quebec, Canada",6 days ago,
1,3766874608,"Data Scientist Ubisoft Montreal, Quebec, Canad...",Ubisoft,Data Scientist,Mid-Senior level,"Montreal, Quebec, Canada",,70.0
2,3636842773,Data Scientist / Senior Data Scientist StackAd...,StackAdapt,Data Scientist / Senior Data Scientist,Mid-Senior level,Canada,1 week ago,
3,3765556140,Data Scientist / Scientifique des données McGi...,McGill St Laurent,Data Scientist / Scientifique des données,Mid-Senior level,"Montreal, Quebec, Canada",1 week ago,
4,3743029150,Scientifique en IA Appliqué / AI Research Scie...,Thales,Scientifique en IA Appliqué / AI Research Scie...,Not Applicable,"Montreal, Quebec, Canada",1 week ago,97.0


🔑 **Note:**

Now we have scraped all Linkedin Job details. 

The next step is to process the data:
1. Create a posted_date column using posted_time_ago;
2. Clean up 'Job_description' (remove sentences like "Remove photo First name Last name Email Password ( 8 + characters )");
3. Clean up the 'level' column.

## Process data

In [69]:
def clean_Job_description(text):
    senetences_to_remove = ["Remove photo First name Last name Email Password (8+ characters) ",
                            "By clicking Agree & Join",
                            "you agree to the LinkedIn User Agreement",
                            "Privacy Policy and Cookie Policy",
                            "Continue Agree & Join or Apply on company website",
                            "Security verification",
                            "Close Already on LinkedIn ?",
                            "Close Already on LinkedIn?",
                            "Sign in Save Save job Save this job with your existing LinkedIn profile , or create a new one",
                            "Sign in Save Save job Save this job with your existing LinkedIn profile, or create a new one",
                            "Your job seeking activity is only visible to you",
                            "Email Continue Welcome back"]
    for sentence in senetences_to_remove:
        result = text.find(sentence)
        if result>-1:
            text = text[:result] + text[result+len(sentence):] # remove sentence from text

    return text 

In [38]:
def get_posted_date(posted_time_ago,date_scraping):
    """Convert posted_time_ago to number of days.
    For example, 1 month ago is replaced by 30. 1 week by 7 and so on..."""
    posted_date = None
    
    try:
        details = posted_time_ago.split()
        N_DAYS_AGO = int(details[0])
        day_week_month_year = details[1] 
        if day_week_month_year.startswith("day"):
            N_DAYS_AGO = N_DAYS_AGO
        elif day_week_month_year.startswith("week"):
            N_DAYS_AGO = N_DAYS_AGO*7
        elif day_week_month_year.startswith("month"):
            N_DAYS_AGO = N_DAYS_AGO*30
        elif day_week_month_year.startswith("year"):
            N_DAYS_AGO = N_DAYS_AGO*365
        else:
            N_DAYS_AGO = None

        posted_date = date_scraping - datetime.timedelta(days=N_DAYS_AGO)
    except:
        posted_date = None

    return posted_date

In [40]:
jobs_DF['scraping_date'] = pd.to_datetime(datetime.date.today())
jobs_DF['posted_date'] = np.vectorize(get_posted_date)(jobs_DF['posted-time-ago'], jobs_DF['scraping_date'])

jobs_DF['Job_txt'] = jobs_DF['Job_txt'].apply(clean_Job_description)
jobs_DF.level = jobs_DF.level.apply(lambda x:x.replace("Employment type\n        \n\n          ","") if x is not None else x)

jobs_DF.head()

Unnamed: 0,Job_ID,Job_txt,company,job-title,level,location,posted-time-ago,nb_candidats,scraping_date,posted_date,skills,match_score,missing_skills
13,3766879037,Scientifique des données- Opérations Ubisoft M...,Ubisoft,Scientifique des données- Opérations,Mid-Senior level,"Montreal, Quebec, Canada",2 days ago,29.0,2023-11-25,2023-11-23,"[python, marketing, security, sql, business, d...",75.0,"security,collaboration"
207,3762695166,Directeur.trice du département science des don...,ChrysaLabs,Directeur.trice du département science des don...,Director,"Montreal, Quebec, Canada",4 days ago,25.0,2023-11-25,2023-11-21,"[python, sql, collaboration, marketing]",75.0,collaboration
230,3766203603,"Director, Client Partner- EN RBC Montreal, Que...",RBC,"Director, Client Partner- EN",Not Applicable,"Montreal, Quebec, Canada",,,2023-11-25,NaT,"[analytics, business, security, support]",75.0,security
3,3766874608,"Data Scientist Ubisoft Montreal, Quebec, Canad...",Ubisoft,Data Scientist,Mid-Senior level,"Montreal, Quebec, Canada",2 days ago,96.0,2023-11-25,2023-11-23,"[database, machine learning, databases, market...",69.2,"databases,security,visualization,segment,software"
103,3767691220,"Data Analyst Logikk Montreal, Quebec, Canada 1...",Logikk,Data Analyst,Mid-Senior level,"Montreal, Quebec, Canada",1 day ago,,2023-11-25,2023-11-24,"[python, machine learning, analytics, computer...",66.7,"computer vision,play,ai"


## Save to json file

In [41]:
jobs_DF.to_json("../data/linkedin_jobs_scraped.json")