# SG Indeed web scraper
## Scrapes [sg.indeed.com](https://sg.indeed.com/) for job postings from Singapore
<img src='https://d34k7i5akwhqbd.cloudfront.net/allspark/static/images/indeed-share-image-9581a8.png' alt='Indeed Logo' width="400">

### Overview
There is a lack of publically available job datasets in Singapore context.

This tool scrapes raw data from [sg.indeed.com](https://sg.indeed.com/). With each job posting, we retrieve:


*   Job title
*   Job description
*   URL to job posting





### Datasets
The following data sets have been generated with this tool:

### Start scraping web pages

In [54]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import random

def scrape(job_title, count):
    job_postings = []
    job_title = job_title.replace(' ', '+')
    
    for i in range(count // 15):
        page = "https://sg.indeed.com/jobs?q={}&start={}".format(job_title, i*10)
        page = requests.get(page)

        soup = BeautifulSoup(page.content)
        jobs = soup.find("ul", {"class": "jobsearch-ResultsList"}).findAll('li',recursive=False)
        del jobs[17]
        del jobs[11]
        del jobs[5]

        for job in jobs:
            title = job.find("h2", {"class": "jobTitle"}).a
            url = "https://sg.indeed.com"+title["href"]
            job_soup = BeautifulSoup(requests.get(url).content,"lxml")
            job_postings.append((url, job_soup))
            sleep(random()*2+1)

        print(f"Found {len(job_postings)}/{count} job postings.")

    return job_postings

Run the following cell to start scraping.

The function takes 2 arguments:

    job_title: str, The search term used to find job postings

    count: int, The number of postings to scrape


In [58]:
raw_text = scrape("software engineer", 600)

Found 15/600 job postings.
Found 30/600 job postings.
Found 45/600 job postings.
Found 60/600 job postings.
Found 75/600 job postings.
Found 90/600 job postings.
Found 105/600 job postings.
Found 120/600 job postings.
Found 135/600 job postings.
Found 150/600 job postings.
Found 165/600 job postings.
Found 180/600 job postings.
Found 195/600 job postings.
Found 210/600 job postings.
Found 225/600 job postings.
Found 240/600 job postings.
Found 255/600 job postings.
Found 270/600 job postings.
Found 285/600 job postings.
Found 300/600 job postings.
Found 315/600 job postings.
Found 330/600 job postings.
Found 345/600 job postings.
Found 360/600 job postings.
Found 375/600 job postings.
Found 390/600 job postings.
Found 405/600 job postings.
Found 420/600 job postings.
Found 435/600 job postings.
Found 450/600 job postings.
Found 465/600 job postings.
Found 480/600 job postings.
Found 495/600 job postings.
Found 510/600 job postings.
Found 525/600 job postings.
Found 540/600 job postings

### Extract data from pages

In [59]:
import pandas as pd

raw_data = {"URL": [], "Job Title": [], "Job Description": []}
for url, job in raw_text:
    title = job.find("h1", {"class": "jobsearch-JobInfoHeader-title"}).get_text()
    description = job.find("div", {"class": "jobsearch-JobComponent-description"}).get_text()

    raw_data["URL"].append(url)
    raw_data["Job Title"].append(title)
    raw_data["Job Description"].append(description)

raw_data = pd.DataFrame(raw_data)

### Export to google drive

In [60]:
from google.colab import drive
drive.mount('/content/drive')

FILENAME = 'datascientist.csv' # edit this line to rename file
raw_data.to_csv(f'/content/drive/My Drive/{FILENAME}', index=False)

drive.flush_and_unmount()

Mounted at /content/drive
