## Jobs scraping

This notebook extracts job offers from elempleo.com

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import wordcloud
from selenium import webdriver
import time

# Url of the main webpage
url = "http://www.elempleo.com/co/ofertas-empleo/"

# Number of pages to be scrapped
nopag = 20

# Initialize browser in page
browser = webdriver.Chrome()
browser.get(url)

# Get html code clicking the next button to navigate the table
pages = []
for _ in range(nopag):
    
    # Save current page content
    html = browser.page_source
    pages.append(BeautifulSoup(html, 'html.parser'))
    
    # Find and click "next" button
    button = browser.find_element_by_xpath("/html/body/div[8]/div[4]/div[1]/div[4]/div/nav/ul/li[8]/a")
    button.click()
    
    # Wait for content to load
    time.sleep(3)

The page presents all the articles in a container of class "main-content". Within that container, news articles are in boxes of the class "mh-posts-list-content".

We extract the code in the main container and then find all the articles. We display the code for the first article.

In [2]:
# Offers are in the third container row instance
tables = [ x.select(".container .row")[2] for x in pages ]
# Find jobs from tables
jobs = [ x.select(".result-item") for x in tables]

# Collapse jobs into a single list
jobs = [ job for sublist in jobs for job in sublist]

In [3]:
def san(s):
    s = [re.sub("\\n|\\t|\\r","",x) for x in s]
    s = [re.sub("^ +","",x) for x in s]
    s = [re.sub(" +$","",x) for x in s]
    return(s)

titles = san([x.select(".text-ellipsis")[0].get_text() for x in jobs])
salaries = san([x.select(".info-salary")[0].get_text() for x in jobs])
cities = san([x.select(".info-city")[0].get_text() for x in jobs])
companies = san([x.select(".info-company-name")[0].get_text() for x in jobs])
dates = san([x.select(".info-publish-date")[0].get_text() for x in jobs])
links = [x.select("div a")[0]['href'] for x in jobs]

dates = [re.sub("^Publicado ","",x) for x in dates]

jobs_tab = pd.DataFrame({
        "date": dates, 
        "firm": companies, 
        "city": cities,
        "title": titles,
        "salary": salaries,
        "link": links
    })
jobs_tab

Unnamed: 0,city,date,firm,link,salary,title
0,Bogotá y ...,4 Jun 2018,STAFFING DE COLOMBIA,/co/ofertas-trabajo/call-center-ventas-gran-fe...,Salario confidencial,Call center ventas gran feria laboral
1,Neiva,4 Jun 2018,BAVARIA S.A.,/co/ofertas-trabajo/profesional-de-despachos/1...,Salario confidencial,Profesional de despachos
2,Bogotá,4 Jun 2018,Empresa confidencial,/co/ofertas-trabajo/administrador-en-servidore...,"$4 a $4,5 millones",Administrador en servidores de aplicación
3,Bogotá,4 Jun 2018,Empresa confidencial,/co/ofertas-trabajo/administrador-bases-de-dat...,"$4 a $4,5 millones",Administrador bases de datos
4,Bogotá y ...,4 Jun 2018,STAFFING DE COLOMBIA,/co/ofertas-trabajo/toderosmantenimiento-locat...,Menos de $1 millón,Toderos//mantenimiento locativo
5,Bogotá y ...,4 Jun 2018,STAFFING DE COLOMBIA,/co/ofertas-trabajo/tecnicos-en-redes-con-mane...,"$1 a $1,5 millones",Tecnicos en redes con manejo de multimetro
6,Bogotá y ...,4 Jun 2018,Empresa confidencial,/co/ofertas-trabajo/disenador/1883405643,"$1,5 a $2 millones",Diseñador
7,Bogotá y ...,4 Jun 2018,Empresa confidencial,/co/ofertas-trabajo/analista-tesoreria/1883231678,"$2 a $2,5 millones",Analista tesorería
8,Bogotá,4 Jun 2018,EFICACIA S.A.,/co/ofertas-trabajo/analista-de-compras/188340...,"$1 a $1,5 millones",Analista de compras
9,Armenia y ...,4 Jun 2018,Empresa confidencial,/co/ofertas-trabajo/enfermero-de-auditoria-de-...,"$3 a $3,5 millones",Enfermero de auditoria de concurrencia - quindio
