**Table of contents**<a id='toc0_'></a>    
- [Web Scraping Tools](#toc1_)    
- [Selenium](#toc2_)    
  - [Case study: Scraping Linkedin job posts](#toc2_1_)    
    - [Install web driver](#toc2_1_1_)    
    - [Log into Linkedin](#toc2_1_2_)    
    - [Find job position](#toc2_1_3_)    
      - [What is the job position you want to search for?](#toc2_1_3_1_)    
      - [What is the job location you want to search for?](#toc2_1_3_2_)    
      - [Can we find what we need from the HTML?](#toc2_1_3_3_)    
      - [Loop through the available pages](#toc2_1_3_4_)    
  - [Extra: Do the scraping using Selenium](#toc2_2_)    
  - [Extra: Save cookies in a pickle 🥒](#toc2_3_)    
- [References/Acknowledgments](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Web Scraping Tools](#toc0_)

Some of the main tools used for web scraping in Python include:
- [`requests`](https://requests.readthedocs.io/en/latest/) - allows you to send HTTP requests easily through built-in structures that mimic the typical HTTP request structure, e.g. `get`, `post`, etc. It's basically the starting point for any web scraping project. However, it has 2 drawbacks: it can only scrape **static** HTML content and it sends **synchronous** requests. This means that it doesn't work well on Javascript heavy pages (i.e. pages with a lot of dynamic content, like `AirBnB`) and it becomes very slow if you want to send a big number of requests.
- `BeautifulSoup` - allows you to extract information from HTML pages using the HTML/CSS structural elements, i.e. tags and attributes.
- `Scrapy` - automates web scraping by providing some of the typical structures for extracting information from websites. It is **asynchronous** and widely used for large scale scraping projects. Drawbacks: It runs on **static** HTML pages and it requires a decent understanding of object-oriented programming.
- `Selenium` - emulates web browsers to enable scraping of Javascript-heavy websites. Drawbacks: It can be slow on its own so it's typically used with `requests`, `BeautifulSoup`, and/or `Scrapy`.
- `aiohttp` - the **asynchronous** cousin of `requests`. Has mostly the same functionality but it doesn't wait for each request to receive a response from the server before sending the next request - i.e. why it's asynchronous. To understand how asynchronous programming works, I highly recommend this [blog post series on the `asyncio` library](https://bbc.github.io/cloudfit-public-docs/asyncio/asyncio-part-1). Please read this **after** the bootcamp though, you likely won't need it now.

**Note:** Remember, before wanting to scrape any website (and especially big websites), make sure that there's an API available that you can use!

# <a id='toc2_'></a>[Selenium](#toc0_)

> Selenium is an open-source framework **widely used for testing web applications**. It empowers developers and testers to automate interactions with web applications, such as clicking buttons, filling forms, and navigating pages, mimicking user behavior. It supports interaction with complex web elements and dynamic content, making it suitable for modern web applications. 

(courtesy of ChatGPT)

## <a id='toc2_1_'></a>[Case study: Scraping Linkedin job posts](#toc0_)

<iframe src="https://giphy.com/embed/dgg13lkNAUa5eibLiY" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/dgg13lkNAUa5eibLiY">via GIPHY</a></p>

In [None]:
# You know the drill
# !pip install selenium
# !pip install webdriver_manager

In [1]:
# time - used to create breaks between requests 
import time

# getpass - to input our password without showing it in the notebook
from getpass import getpass

# Juicy stuff - these are the Classes we will use for interaction with a webpage:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common import NoSuchElementException, ElementClickInterceptedException

# libraries for interacting with the operating system (OS)
import pathlib
import os
from os.path import join

import pandas as pd
import random
import re
from bs4 import BeautifulSoup

# Ignore warning -- Some methods are going to be deprecated 
import warnings
warnings.filterwarnings('ignore')

### <a id='toc2_1_1_'></a>[Install web driver](#toc0_)

In [2]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

### <a id='toc2_1_2_'></a>[Log into Linkedin](#toc0_)

In [3]:
# open the website
driver.get('https://www.linkedin.com/login/')

In [4]:
# Add email
email = input('Enter your email: ')

In [6]:
# Find email box
email_box = driver.find_element(By.ID, "username")

In [7]:
# Clear email box
email_box.clear()

In [8]:
# Input password into browser
email_box.send_keys(email)

In [9]:
# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [10]:
# Add password
password = getpass('Enter your password: ')

# Find password box
pass_box = driver.find_element(By.ID, "password")

# Clear password box
pass_box.clear()

# Input password into browser
pass_box.send_keys(password)

# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [11]:
# Find and click on the log-in button
login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
login.click()
time.sleep(random.random() * 3)

In [15]:
# Add exception handling
try:
    login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
    login.click()
    time.sleep(random.random() * 3)
except NoSuchElementException:
    print("Log-in already done!")
except Exception as e:
    raise e

Log-in already done!


### <a id='toc2_1_3_'></a>[Find job position](#toc0_)

In [22]:
# Go to job search bar
try:
    job_icon = driver.find_element(By.CSS_SELECTOR, "span[title='Jobs']")
    job_icon.click()
    time.sleep(random.random() * 3)
except ElementClickInterceptedException:
    print("Element not displayed by JS. Try zooming in or resizing the window")
except Exception as e:
    print(repr(e))

In [17]:
# Zooming in
driver.execute_script("document.body.style.zoom='200%'")

In [18]:
# Zooming out
driver.execute_script("document.body.style.zoom='67%'")

In [19]:
driver.maximize_window()

#### <a id='toc2_1_3_1_'></a>[What is the job position you want to search for?](#toc0_)

In [None]:
# Optional - Change window size
# driver.set_window_size(800, 600)

In [23]:
search_job = driver.find_elements(By.CLASS_NAME,'jobs-search-box__text-input')[0] 
job = input('What job do you want to search for: ')
search_job.clear()
search_job.send_keys(job)
time.sleep(random.random() * 3)

# Go to the location tab
search_job.send_keys(Keys.TAB)

#### <a id='toc2_1_3_2_'></a>[What is the job location you want to search for?](#toc0_)

In [24]:
location_box = driver.switch_to.active_element
location = input('Where do you want to search for jobs: ')
location_box.send_keys(location)
time.sleep(random.random() * 3)

In [25]:
# Now let's search
location_box.send_keys(Keys.ENTER)

**Note:** In this exercise I'm keeping both windows open at the same time alongside each other. If you switch from one window to another the same strategy won't work.  

**Why?** Because Linkedin will close the location search bar as soon as you switch to VSCode. 

**Why do they do that?** Probably to get rid of us :(

In [26]:
# Maximize the window - useful to see all the elements as the page is dynamic
driver.maximize_window()

In [None]:
## Optional: you can also fullscreen the window
# driver.fullscreen_window()

#### <a id='toc2_1_3_3_'></a>[Can we find what we need from the HTML?](#toc0_)

As mentioned previously, Selenium can be quite slow, so we'd always want to check whether we can fetch our data directly using static web scraping tools (i.e. `requests`, `BeautifulSoup`, `scrapy`):

In [33]:
# Check if the source code contains the job listings
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})

[<a aria-label="Game Data Analyst (f/m/d)" class="disabled ember-view job-card-container__link job-card-list__title" data-control-id="omoIfEZxOA4f7faYk4/eWQ==" href="/jobs/view/3697113994/?eBP=CwEAAAGK0oLuVJ4ZtRsrff_Nri7Jq13we8ZWVRIyBZJSVq2iFdnjOhMuJ-6lKt8e0iKgPbBAFshFg4uz4B0RxCnBwxzdOQzqmGWwZASPSulfEMypNKEriwRllq3mrpFMh2D20ehIq0yTONmE9_V4F1SphfCmxkYVgXEMHQxJaTGNakU3hoIHVpUo2elzm4E0acAa0K70TPA7P5tXVHmaHx-UsVNiFl77FeslVyDqILB9ONV6E8gkEQlLbcIcTCSG-y4eRzkV8nwr_Xm-LFM7gX-7DCPGoCYW1WaFjOUAmw6TgC1hcy39kzKus85YSk4PKZ-Uzqo8-UCieP7Jn3_3fh5IOIFvkg3fJQLdKTR-Pz1kO7m_MS8xYCz2gGNTyH5rXs81wwuW&amp;refId=ctcsc6TNUEVZm4C0mLCV7g%3D%3D&amp;trackingId=omoIfEZxOA4f7faYk4%2FeWQ%3D%3D&amp;trk=flagship3_search_srp_jobs" id="ember226" tabindex="0">
                   Game Data Analyst (f/m/d)
                 </a>,
 <a aria-label="Content Data Analyst - Food Science Domain, AI Content Generation - Consumer Data Products (all genders)" class="disabled ember-view job-card-container__link job-card-list__title" da

In [29]:
# Clean the list
job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
job_list_clean = [job.text.strip() for job in job_list_dirty]
job_list_clean

['Game Data Analyst (f/m/d)',
 'Content Data Analyst - Food Science Domain, AI Content Generation - Consumer Data Products (all genders)',
 'Senior Data Analyst (m/w/x) Product Analytics',
 '(Senior) Consultant Financial Advisory - Schwerpunkt Financial Modeling und Analytics (w/m/d)',
 'Data Analyst',
 '(Junior) Consultant | Business Intelligence - Planung & Reporting (m/w/d) in Berlin',
 'Freelance Data Analyst - Trading (m/f/d)',
 'Associate Consultant - SAP S/4HANA Analytics (w/m/x)',
 'Lead Data Analyst - OPEX Analytics - Content Solutions - Lounge by Zalando (all genders)',
 'Senior Consultant Forensic - Data Analytics (m/w/d) in Berlin',
 'Product Data Analyst',
 'IT Architekt / Business Data Analyst (w|m|d)',
 'Product Data Analyst (m/f/d)',
 'Senior/Lead Data Analytics Consultant - Fashion and Luxury Goods (m/w/d)',
 'Consultant, Data & Analytics | Forensic and Litigation Consulting',
 'Data Analyst',
 'Senior Data Analyst',
 '(Senior) Consultant Data & Analytics - Projektmana

In [30]:
# Do the same for the company
job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
job_company_clean = [company.text.strip() for company in job_company_dirty]
job_company_clean

['Kolibri Games',
 'Delivery Hero',
 'sevDesk',
 'PwC Deutschland',
 'Orange Quarter',
 'BearingPoint',
 'Convex Energy',
 'IBM',
 'Zalando',
 'Deloitte',
 'kevin.',
 'zeb consulting',
 'Inkitt',
 'EPAM Systems',
 'FTI Consulting',
 'CrazyLabs',
 'Klarna',
 'adesso SE',
 'MongoDB',
 'AOK-Bundesverband',
 'Springer Nature Group',
 'Personio',
 'SellerX',
 'Computer Futures',
 'AERTiCKET Gruppe',
 '208,131 followers']

In [31]:
# Make it into a dataset
data = zip(job_list_clean, job_company_clean)
df = pd.DataFrame(data, columns=['Job', 'Company'])
df

Unnamed: 0,Job,Company
0,Game Data Analyst (f/m/d),Kolibri Games
1,"Content Data Analyst - Food Science Domain, AI...",Delivery Hero
2,Senior Data Analyst (m/w/x) Product Analytics,sevDesk
3,(Senior) Consultant Financial Advisory - Schwe...,PwC Deutschland
4,Data Analyst,Orange Quarter
5,(Junior) Consultant | Business Intelligence - ...,BearingPoint
6,Freelance Data Analyst - Trading (m/f/d),Convex Energy
7,Associate Consultant - SAP S/4HANA Analytics (...,IBM
8,Lead Data Analyst - OPEX Analytics - Content S...,Zalando
9,Senior Consultant Forensic - Data Analytics (m...,Deloitte


In [32]:
# Great, let's now create a function out of this:
def get_job_postings(driver, page):
     
     # Zoom in 100% to ensure all HTML is loaded
     driver.execute_script("document.body.style.zoom='100%'")
    
     # Go to bottom of page to retrieve all job postings
     page.send_keys(Keys.END)
     page.send_keys(Keys.CONTROL + Keys.HOME) # combination of the two keys brings you to the top of the element
    
     # Parse HTML
     html = driver.page_source
     soup = BeautifulSoup(html)
    
     # Get jobs
     job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
     job_list_clean = [job.text.strip() for job in job_list_dirty]
    
     # Get companies
     job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
     job_company_clean = [company.text.strip() for company in job_company_dirty]
    
     # Convert data in to dataframe
     data = zip(job_list_clean, job_company_clean)
     return pd.DataFrame(data, columns=['Job', 'Company'])

In [34]:
page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
get_job_postings(driver, page)

Unnamed: 0,Job,Company
0,Game Data Analyst (f/m/d),Kolibri Games
1,"Content Data Analyst - Food Science Domain, AI...",Delivery Hero
2,Senior Data Analyst (m/w/x) Product Analytics,sevDesk
3,(Senior) Consultant Financial Advisory - Schwe...,PwC Deutschland
4,Data Analyst,Orange Quarter
5,Freelance Data Analyst - Trading (m/f/d),Convex Energy
6,(Junior) Consultant | Business Intelligence - ...,BearingPoint
7,Senior Consultant Forensic - Data Analytics (m...,Deloitte
8,Associate Consultant - SAP S/4HANA Analytics (...,IBM
9,BI Entwickler / Analyst (m/w/d),Computer Futures


#### <a id='toc2_1_3_4_'></a>[Loop through the available pages](#toc0_)

In [35]:
# Get a list with the buttons in the page
def get_buttons(page):
    buttons = []
    for button in page.find_elements(By.XPATH, "//ul/li/button"):
        try:
            int(button.text)
            buttons.append(button)
        except:
            pass
    return buttons

In [36]:
# Get the number of pages to scrape
current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
buttons = get_buttons(current_page)

In [40]:
button_nos = [button.text for button in buttons]
button_no = max(button_nos)
button_no

'7'

In [44]:
# Loop through pages and save results in a dataframe
df = pd.DataFrame()
driver.execute_script("document.body.style.zoom='100%'")

for i in range(len(buttons)):
    # Printing the button number for debugging purposes
    print(i)
    
    # Extract posts from current page
    current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
    postings = get_job_postings(driver, current_page)
    
    # Refresh button list (if you don't the code will throw an exception.. trust me I spent half an hour debugging it)
    current_buttons = get_buttons(current_page)
    
    # Add to dataframe
    df = pd.concat([df, postings], axis=0)
    
    # Go to the next page
    current_buttons[i].click()

0
1
2
3
4
5
6


In [45]:
# Check dataframe
df.drop_duplicates()

Unnamed: 0,Job,Company
0,Senior Consultant/Manager (w/m/d) Business Int...,WTS Deutschland
1,Senior Consultant (m/w/d) IBM Cognos Analytics...,avantum consult GmbH
2,Senior Biostatistician,Allucent
3,Business Intelligence Consultant (m/w/d),CBTW
5,Senior Data Analyst (m/w/d),accantec group
...,...,...
18,SEO - Data Analyst:in,DADAJ
19,Master Data Analyst bei Zeiss (m/w/d) (Job-ID:...,Steinbeis Center of Management and Technology ...
20,Principal Biostatistician,AL Solutions
21,Business Analyst / Data Analyst Logistik (w/m/...,SCI Verkehr GmbH


## <a id='toc2_2_'></a>[Extra: Do the scraping using Selenium](#toc0_)

This bit is to illustrate how slow Selenium can be in comparison to retrieving the HTML for the page:

In [46]:
def page_scraper(job_no): ## add pages

    """ SUMMARY: This function retrieves all the job posts links from one page and returns a dataset with
    the name of the job in one column and the link to the post in the other. Also it will write the same info in different files for every single job post.

    HOW IT WORKS: Input the number of jobs you want to scrape. It will search on the page for the elements by css selector 
    from all the job posts then loop for every single element and retrieve the 'href'. Also it will click on every job post and find the job name.
    This info will be saved in a dictionary that will in the end be converted to a dataset.
    Below we will open and create a text file with the name of the job post and inside save the link for further details"""

    # For scraper reasons it's required to duplicate the job_no as it retrieves 2 times the same position:
    #job_no = job_no*2

    # empty list for saving the job names , link and extra info:
    job_list = []

    # Reduce the page size in order to be able to find the name of the job in the right session
    driver.execute_script("document.body.style.zoom='67%'")

    # all jobs in the page
    job_raw = driver.find_elements(By.CSS_SELECTOR,"a[class^='disabled ember-view']")

    # go to the end of the page for all the elements to be loaded
    page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
    page.send_keys(Keys.END)
    # go to the top of the page for all the elements to be loaded
    page.send_keys(Keys.CONTROL + Keys.HOME) # combination of the two keys brings you to the top of the element


    for job_index in range(job_no):
        # get the job link
        ref = job_raw[job_index].get_attribute('href')
        time.sleep(random.random() * 3)

        # increase the page size because the inspect for getting the job name where done wiht the page maximized
        driver.execute_script("document.body.style.zoom='100%'")

        ## let's click on the job post ##
        # driver.find_elements_by_css_selector("a[class^='disabled ember-view']")[job_index].click()
        job_raw[job_index].click()
        time.sleep(random.random() * 3)

        ## then we reduce the page size in order to be able to see the right part of the page
        # and find the element with the name of the job ##
        driver.execute_script("document.body.style.zoom='67%'")
        time.sleep(random.random() * 3)

        # get the job name with the .text method
        job_name = driver.find_element(By.CSS_SELECTOR, "h2[class^='t-24 t-bold']").text
        time.sleep(random.random() * 3)

        # Couldn't retrieve the company name with the same method so created a workaround
        company_name = " ".join(driver.find_element(By.CSS_SELECTOR, "a[href^='/company']").get_attribute('href').split("/")[-3].split("-")).title()
        print(company_name)

        # get company name:
        job_details = driver.find_element(By.ID, "job-details").text

        # increase the page size:
        #driver.execute_script("document.body.style.zoom='100%'")

        # populate list:
        job_idx_list = [ref, job_name, company_name, job_details]
        time.sleep(random.random() * 3)

        page.send_keys(Keys.PAGE_DOWN)
        job_list.append(job_idx_list)
        print(f"Collected job: {job_name} for company: {company_name}")

    #Create dataframe:
    job_df = pd.DataFrame(job_list,
                                 columns = ["job_link", "position", "company name", "job description"]
                                ).drop_duplicates()


    #Save dataframe in excel file to later use our job
    job_df.to_excel(pathlib.Path().joinpath('scraped_jobs.xlsx'),
                           sheet_name='Jobs',
                           index= False)

    return job_df

In [47]:
# Here I input the number of jobs to reduce the collection time
page_scraper(5)

Wts Deutschland
Collected job: Senior Consultant/Manager (w/m/d) Business Intelligence for company: Wts Deutschland
Avantum Consult
Collected job: Senior Consultant (m/w/d) IBM Cognos Analytics/Cognos BI for company: Avantum Consult
Allucent Cro
Collected job: Senior Biostatistician for company: Allucent Cro
Collaboration Betters The World
Collected job: Business Intelligence Consultant (m/w/d) for company: Collaboration Betters The World
Collaboration Betters The World
Collected job: Business Intelligence Consultant (m/w/d) for company: Collaboration Betters The World


Unnamed: 0,job_link,position,company name,job description
0,https://www.linkedin.com/jobs/view/3584211731/...,Senior Consultant/Manager (w/m/d) Business Int...,Wts Deutschland,About the job\nIhre Aufgaben | Your tasks:\nDu...
1,https://www.linkedin.com/jobs/view/3495716962/...,Senior Consultant (m/w/d) IBM Cognos Analytics...,Avantum Consult,About the job\nInnerhalb der All for One Group...
2,https://www.linkedin.com/jobs/view/3723983608/...,Senior Biostatistician,Allucent Cro,About the job\nAllucent is a full-service cont...
3,https://www.linkedin.com/jobs/view/3722318579/...,Business Intelligence Consultant (m/w/d),Collaboration Betters The World,About the job\nPositive Thinking Company ist T...
4,https://www.linkedin.com/jobs/view/3717263213/...,Business Intelligence Consultant (m/w/d),Collaboration Betters The World,About the job\nPositive Thinking Company ist T...


If you work in Jupyter notebook, you can use the magic function `%time` before your function to check how long it took to run:

In [49]:
%time page_scraper(2)

Wts Deutschland
Collected job: Senior Consultant/Manager (w/m/d) Business Intelligence for company: Wts Deutschland
Avantum Consult
Collected job: Senior Consultant (m/w/d) IBM Cognos Analytics/Cognos BI for company: Avantum Consult
CPU times: total: 78.1 ms
Wall time: 20 s


Unnamed: 0,job_link,position,company name,job description
0,https://www.linkedin.com/jobs/view/3584211731/...,Senior Consultant/Manager (w/m/d) Business Int...,Wts Deutschland,About the job\nIhre Aufgaben | Your tasks:\nDu...
1,https://www.linkedin.com/jobs/view/3495716962/...,Senior Consultant (m/w/d) IBM Cognos Analytics...,Avantum Consult,About the job\nInnerhalb der All for One Group...


In [50]:
driver.close() # closes the driver

: 

## <a id='toc2_3_'></a>[Extra: Save cookies in a pickle 🥒](#toc0_)

In [None]:
# Save cookies in a pickle file
import pickle

# Create an empty folder
cookies_dir = 'saved_cookies'
lis_dir = os.listdir()

if cookies_dir not in lis_dir:
    os.mkdir(cookies_dir)
else:
    pass # os.removedirs(cookies_dir) --> to remove a directory

save_location = cookies_dir + '/cookies.pkl'
pickle.dump(driver.get_cookies() , open(save_location,"wb"))

In [None]:
# Load cookies
cookies = pickle.load(open(save_location, "rb"))
for cookie in cookies:
    driver.add_cookie(cookie)

# <a id='toc3_'></a>[References/Acknowledgments](#toc0_)

Thanks Goncalo Jardim for the main class structure and code to scrape Linkedin job posts.