# **Linkedin Jobs Scrapping**
In this project, I will conduct a web scraping data analysis to automatically extract job posting data from a job posting site. The goal is to be more prepared to land my first job as a intern in the data world. 

I want to see the location of the internship jobs and the requirements. Sadly, in Europe is not common to publish salary data, but I will also check the salaries of my -hopefully- first junior position. 

To achieve this, I will set up the environment, identify the job posting site, scrape the data, process, analyze, and visualize the data.

Let's do it!

### Step 1: Understanding the LinkedIn Jobs Search Page

A quick inspection to the search page throws at us certain inconveniences:
- The list of jobs are in individual cards with little information: company name, job name, location, and some extra info, like job id and link to the job post. So not everything is in one page.
- It has **lazy load**. LinkedIn only loads the first 35 jobs at first. You need to scroll to the bottom of the page to load the next 25 jobs. 
- After 5 times lazyloading, a **'see more jobs'** button will pop up. 

Let's first solve the last two problems. **How can I load all the jobs in a search?**

### Step 2: Scroll down and click with Selenium

**Selenium** is a python library that automates browsers. So, basically, it will open a chrome tab and scroll down and click in the 'see more jobs' button for us.  

In [4]:
#Import Packages
from selenium import webdriver
import time
import pandas as pd
import os

from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service

In [23]:
#This code below opens a chrome window in the referenced url
#Setting webdriver 
service = Service(executable_path=r'D:\\DATA\\Projects\\Linkedin-Jobs-Scrapping\\chromedriver-win64\\chromedriver.exe')
driver = webdriver.Chrome(service=service)
#The url we want to open. Later in the process we will make this url dynamic. 
url1 = 'https://www.linkedin.com/jobs/search/?keywords=Data&location=Spain&geoId=105646813&f_TPR=&f_E=1&position=1'
driver.get(url1) #open the job search page
driver.implicitly_wait(10) #for safety, wait 10 segs to give time to load completely 

In [24]:
#Find number of jobs in the search

n = driver.find_element(By.CLASS_NAME, 'results-context-header__job-count').text
#Because LinkedIn writes '+17,700', we have to deal with any symbols to convert the number to an int.
n = n.rstrip("+")
n = int(n.replace(",", ""))
y=pd.to_numeric(n)
y

612

In [25]:
#Scrolling down and loading all jobs 
i = 0
while i <= int((y+35+25)/25): 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    i = i + 1
    print('number of jobs:', 35+i*25)
    
    try:
        driver.find_element(By.XPATH,'//*[@id="main-content"]/section[2]/button').click()
        time.sleep(1)
            
    except:
        pass
        time.sleep(2)
         

number of jobs: 60
number of jobs: 85
number of jobs: 110
number of jobs: 135
number of jobs: 160
number of jobs: 185
number of jobs: 210
number of jobs: 235
number of jobs: 260
number of jobs: 285
number of jobs: 310
number of jobs: 335
number of jobs: 360
number of jobs: 385
number of jobs: 410
number of jobs: 435
number of jobs: 460
number of jobs: 485
number of jobs: 510
number of jobs: 535
number of jobs: 560
number of jobs: 585
number of jobs: 610
number of jobs: 635
number of jobs: 660
number of jobs: 685
number of jobs: 710


###  Step 3: Extracting the data from the job list

In [10]:
#Let's initiate some lists
CompanyName = []
JobTitle= []

In [26]:
#testing the code
driver.find_elements(By.CLASS_NAME, 'base-search-card__subtitle')[116].text

'Emerald Stay® (A certified B Corp)'

In [27]:
#Loop to go through all jobs and extract their data
for job in range(y):
    company = driver.find_elements(By.CLASS_NAME, 'base-search-card__subtitle')[job].text
    CompanyName.append(company)

CompanyName

IndexError: list index out of range

Here, I encountered a `IndexError: list index out of range` all the time. 

I checked more deeply how linkedIn loads their jobs and how the page look. After several attempts and searches, I figured out that linkedIn never shows the full list of available jobs.

If your search job number is an inexact number (+17,000, for example), the page stops loading more jobs after several clicks on the button. 

If your search is an exact number, you can reach the 'You've viewed all jobs for this search' tag. But it is a bit deceitful. If you inspect their code, the last card is even referenced as the last number of your search. That is, if you had 617 jobs, you will see a div with the class: `data-row="610"`. But actually they had been skipping numbers all the time. So out of a search of 600, they usually show half of them: 300.

This could be good enough. But we are learning here. So I decided to try a different method. 

### Step 2b: using Scrapy. 