## Job Posting Scrapper

In the digital age, Web Scrapping for Job posting have become indispensable for people looking to automate the process of gathering data from job board websites.

The primary objective of a Job Scraper is to meticulously collect details such as job titles, descriptions, company names, locations, and occasionally salary data from the listings on this site (Indeed.com). 

In [1]:
# import Libraries

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


In [14]:
# Starting from a single page
#URL = "https://ng.indeed.com/jobs?q=data+scientist&l=Nigeria&from=searchOnHP&vjk=3fae9bce33111474"

URL = "https://www.shine.com/job-search/data-jobs-in-uk?q=data&loc=UK"
#conducting a request of the stated URL above:
page = requests.get(URL)

#specifying the desired format of "page" using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, "html.parser")

#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="https://staticcand.shine.com/" rel="preconnect"/>
  <link href="https://analytics.htmedia.in/" rel="preconnect"/>
  <link href="https://connect.facebook.net/" rel="dns-prefetch"/>
  <link href="https://www.googletagmanager.com/" rel="dns-prefetch"/>
  <link href="https://www.google-analytics.com/" rel="dns-prefetch"/>
  <link href="https://www.staticlearn.shine.com/" rel="preconnect"/>
  <link href="https://www.staticrect.shine.com/" rel="preconnect"/>
  <meta charset="utf-8"/>
  <script async="" src="https://cdn.debugbear.com/HVNoyGDk42Eg.js">
  </script>
  <title>
   Data Jobs in UK (Mar 2024) - 87 Data Openings in UK - Shine.com, Apply Now!
  </title>
  <meta content="Checkout latest 87 Data Jobs in UK. Apply Now for Data Jobs Openings in UK.&amp;#10003; Top Jobs* &amp;#10003; Free Alerts on Shine.com, Apply Now!" name="description"/>
  <meta content="Data Jobs in UK (Mar 2024) - 87 Data Openings in UK - Shine.com, Apply Now!" p

In [15]:
print('The response that we got back from the URL is', page.status_code)

The response that we got back from the URL is 200


Since we are getting 200 as response object, we are ready to proceed ahead.

#### Extracting the five key points for every job posting

In [16]:
# investigate HTML

html = soup.find_all('div')
html

[<div id="__next"><header class="mHeader" id="mHeaderId"><div class="slideMenuWrap"><nav class="slideMenu"><div class="slideMenu__header"><div class="slideMenu__header--info"><strong>Welcome!</strong><div class="d-flex mt-10"><a class="btn btn-outline-secondary mr-15" href="/myshine/login/">Sign in</a><a class="btn btn-outline-secondary" href="/registration/">Register</a></div></div></div><ul class="tabs"></ul><ul class="site-links"><li><a href="/blog?utm_source=www.shine.com&amp;amp;utm_medium=searchdefault&amp;amp;utm_campaign=msite_links">Blog</a></li><li><a href="/securityadvice">Fraud Alert</a></li><li><a href="/aboutus">About Us</a></li><li><a href="/contactus" target="_self">Contact Us</a></li><li><a href="/faqs">FAQs</a></li><li><a class="" data-for="andriod" data-type="leftPanel" href="" id="id_downloadapp">Download App</a></li><li><a href="/contactus?type=reportJobPosting">Report a Job Posting</a></li></ul><ul class="other-links"><li><span class="other-links--left">Employers<

The HTML contains the contents of the website in the HTML format.

In [17]:
# Getting the JOb Title

req = soup.select('div h2[itemprop="name"]')
#fetching the text from the html
titles = [r.text for r in req]
#Removing any spaces
titles = [t.replace("  ", "") for t in titles]
titles[:5]

['Data OperatorIn LONDON',
 'Urgent Hiring for Data Analyst',
 'Computer Operator Data Entry Operator',
 'Payroll Administrator',
 'Data OperatorIn UNITED KINGDOM']

In [19]:
# Find Employer Name / Company

orgs = soup.find_all('div', class_='jobCard_jobCard_cName__mYnow')
#fetching the text from the HTML
orgs1 = [o.text for o in orgs]
sub_string ='Hiring'
#Splitting the string on a sub string and getting the first index (Cleaning up names)
orgs1 = [o.split(sub_string)[0] for o in orgs1]
#Removing any spaces
orgs1 = [o.strip() for o in orgs1]

orgs1[:5]

['Adal immigrations LLP',
 'Prime Immigration LLP',
 'NATIONAL SEEDS CORPORATION LIMITED',
 'MILLENNIUM BABYCARES PRIVATE LIMITE...',
 'Adal immigrations LLP']

In [21]:
# Get Job Location

#fetching the HTML data from the class where the location data is available
loc = soup.find_all('div', class_='jobCard_jobCard_lists__fdnsc')
#fetching all the text from the HTML
location = [l.text for l in loc]
#cleaning the locations (Getting everything after the Yr(s))
location = [re.findall("Yrs?(.*)$", i)[0] for i in location]
#Getting rid of unnecessary text
location = [l.replace("+4", ", ") for l in location]

location[:5]

['United Kingdom',
 'United Kingdom+8Australia, United Arab Emirates, Canada, Malaysia, Singapore, Malta, Germany, Hong Kong',
 'United Kingdom+13Australia, Canada, United Arab Emirates, United States Of America, Gurugram, Mumbai City, Bangalore, Noida, Chennai, Hyderabad, Kolkata, Pune, Delhi',
 'United Kingdom+12Canada, United Arab Emirates, Gurugram, United States Of America, Mumbai City, Bangalore, Noida, Chennai, Hyderabad, Kolkata, Pune, Delhi',
 'United Kingdom']

In [23]:
# Get Experience Years

#fetching the text from the loc variable for the experience
exp = [l.text for l in loc]
#Cleaning up using regex
experience = [re.findall("^(.*) Yrs?", i)[0] for i in exp]

experience[:5]

['2 to 7', '5 to 10', '5 to 10', '4 to 9', '2 to 7']

In [24]:
# Get Job Position

vac = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')
#fetching the text from the HTML
vac = [v.text for v in vac ]

#Cleaning up the data
vacancies = [int(re.findall(r'\d+', text)[0]) if re.findall(r'\d+', text) else 1 for text in vac]

In [25]:
vacancies[:5]

[70, 1, 99, 99, 60]

In [26]:
# Putting the data together into a Dataframe

data = {'Titles':titles, 'Firm Name': orgs1, 
        'Job Location':location, 'Experience':experience,
        'Positions': vacancies}

Job_df = pd.DataFrame(data)

In [29]:
# Check for Duplicate

Job_df['Titles'].duplicated().sum()

#Dropping duplicates
Job_df.drop_duplicates(subset='Titles', inplace=True)

In [30]:
# check DataFrame
Job_df.head()

Unnamed: 0,Titles,Firm Name,Job Location,Experience,Positions
0,Data OperatorIn LONDON,Adal immigrations LLP,United Kingdom,2 to 7,70
1,Urgent Hiring for Data Analyst,Prime Immigration LLP,"United Kingdom+8Australia, United Arab Emirate...",5 to 10,1
2,Computer Operator Data Entry Operator,NATIONAL SEEDS CORPORATION LIMITED,"United Kingdom+13Australia, Canada, United Ara...",5 to 10,99
3,Payroll Administrator,MILLENNIUM BABYCARES PRIVATE LIMITE...,"United Kingdom+12Canada, United Arab Emirates,...",4 to 9,99
4,Data OperatorIn UNITED KINGDOM,Adal immigrations LLP,United Kingdom,2 to 7,60


In [31]:
Job_df.shape

(17, 5)

The above code extract only a page, however the job posting exceeds one page, we would write a loop for the rest of the pages 

In [64]:
TITLES = []
COMPANIES = []
LOCATIONS = []
EXPERIENCE = []
VACANCIES = []

Range = range(1,6)
for i in Range:
    link = "https://www.shine.com/job-search/data-jobs-in-uk?q=data&loc=UK"
    response = requests.get(link)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        #Extract Job Title
        req = soup.select('div h2[itemprop="name"]')
        titles = [r.text for r in req]
        titles1 = [t.replace("|","") for t in titles]
        titles = [t.replace("  ", "") for t in titles1]
        TITLES.extend(titles)
        
        #Extract Company Name
        orgs = soup.find_all('div', class_='jobCard_jobCard_cName__mYnow')
        orgs1 = [o.text for o in orgs]
        sub_str = "Hiring"
        companies = [o.split(sub_str)[0] for o in orgs1]
        COMPANIES.extend(companies) 
        
        #Extract Job Location
        loc = soup.find_all('div', class_='jobCard_jobCard_lists__fdnsc')
        location = [l.text for l in loc]
        location = [re.findall("Yrs?(.*)$", i)[0] for i in location]
        location = [l.replace("+4", ", ") for l in location]
        LOCATIONS.extend(location)
        
        #Extract Experience
        exp = [l.text for l in loc]
        experience = [re.findall("^(.*) Yrs?", i)[0] for i in exp]
        EXPERIENCE.extend(experience)  
        
        #Extract Positions
        vacancies = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')
        vac = [v.text for v in vacancies]
        vacancies = [int(re.findall(r'\d+', text)[0]) if re.findall(r'\d+', text) else 1 for text in vac]
        VACANCIES.extend(vacancies)
        
    else:
        print('Invalid Response')

df = pd.DataFrame({'Job Title': TITLES, 
                   'Employer': COMPANIES,
                   'Job Location': LOCATIONS, 
                   'Experience': EXPERIENCE, 
                   'Positions': VACANCIES})

print(f'We have managed to fetch {len(df)} job postings while scraping {len(Range)} pages.')

We have managed to fetch 100 job postings while scraping 5 pages.


In [65]:
df.head(10)

Unnamed: 0,Job Title,Employer,Job Location,Experience,Positions
0,Data OperatorIn LONDON,Adal immigrations LLP,United Kingdom,2 to 7,70
1,Urgent Hiring for Data Analyst,Prime Immigration LLP,"United Kingdom+8Australia, United Arab Emirate...",5 to 10,1
2,Computer Operator Data Entry Operator,NATIONAL SEEDS CORPORATION LIMITED,"United Kingdom+13Australia, Canada, United Ara...",5 to 10,99
3,Payroll Administrator,MILLENNIUM BABYCARES PRIVATE LIMITE...,"United Kingdom+12Canada, United Arab Emirates,...",4 to 9,99
4,Data OperatorIn UNITED KINGDOM,Adal immigrations LLP,United Kingdom,2 to 7,60
5,Geologist to analyze,MILLENNIUM BABYCARES PRIVATE LIMITE...,"United Kingdom+12Canada, United Arab Emirates,...",5 to 10,99
6,Sourcing Manager,Garima Interprises,"United Kingdom+2Australia, United States Of Am...",10 to 15,29
7,Data Management,Future Solution Centre,"United Kingdom+18China, Canada, Qatar, Kuwait,...",15 to >25,99
8,Production EngineerIn United Kingdom,Adal immigrations LLP,United Kingdom,2 to 7,70
9,Data Scientist,Vijay Deonath Kuwar,United Kingdom,1 to 6,17


### Export into a CSV File

In [66]:
# Define the file path where you want to save the CSV file
file_path = "Shine Job Posting.csv"

# Export the DataFrame to a CSV file
df.to_csv(file_path, index=False)

print("DataFrame successfully exported to CSV file:", file_path)

DataFrame successfully exported to CSV file: Shine Job Posting.csv


### Summary
1. Performing HTTP Requests: Utilizing the requests library, we initiate HTTP GET requests directed at the specified URLs. Before proceeding, we validate the status code of the response to ensure a successful outcome (HTTP status code 200).

2. Parsing HTML Content: Following the successful retrieval of a webpage, we employ the BeautifulSoup library to parse through its HTML content. This facilitates navigation through the structure of the page, enabling us to extract pertinent information.

3. Data Extraction: We navigate the HTML structure of a website to extract targeted data, often stored within various classes. Our objective typically involves retrieving job titles, company names, locations, and experience requirements.

4. Iterating Through Pages: After mastering the extraction process for a single page, we extend our approach to encompass multiple pages. Each page typically comprises a list of job postings. Accordingly, we iterate through a range of pages, constructing unique URLs for each page by incorporating the page number and relevant query parameters such as job type, top companies, and sorting criteria.