## Job Text Analytics

The main aim of this project is to scrape jobs from websites like indeed related to project contols personnel and analyse key features needed in the market. Using text analytics top competencies can be extracted.

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
import csv


### Helper functions

These functions are designed to extract and clean text from webpages
Extracted data are job title, url, advertising company, location, summary, and full job description
These are achieved by identifying HTML tags along with attributes such as classes 

In [2]:
def extract_text(item):
    if item:
        return item.text.strip()
    else:
        return ''
    
def get_title_from_result(result):
    return result.find('a',{'data-tn-element': 'jobTitle'}).text.strip()

def get_full_jd(result):
    if result is None: 
        return None
    return extract_text(result.find('div',{'class': 'jobsearch-JobComponent-description'}))

def get_url(result):
    a = result.find('a',{'data-tn-element': 'jobTitle'})
    return(a['href'])
    
def get_company_from_result(result):
    return extract_text(result.find('span',{'class': 'company'}))

def get_location_from_result(result):
    return extract_text(result.find('span',{'class': 'location'}))

def get_summary_from_result(result):
    return extract_text(result.find('span',{'class': 'summary'}))

## Scraping

This section performs the web scraping from indeed. There are two main parameters used during the scraping:
1- Salary range - this varies between countries
2- page number (start variable) 

The code loops through both variables creating links with search parameters and extracts job attributes apart from detailed job descritption. Each individual job is stored in a dictionary and all jobs are added into a list.

The data are dumped into a json file for presistence


In [None]:
url="https://www.indeed.com/jobs?q=%22project+control%22+${}%2C000&l=USA&radius=100&jt=fulltime&limit=500"
UKurl ="https://www.bayt.com/en/international/jobs/q/project-controls/?page={}"
max_result_per_city=6000

rows=[]
# for salary in set(['55-70','70-85','85-100','100-115']):
for salary in set(['35-40','40-45','45-50','50-55', '55-60']):
    for start in range(10):
        r=requests.get(UKurl.format(salary, start))
        soup=BeautifulSoup(r.content,"lxml")
        results=soup.findAll('div',{'class':  'result'})
        for result in results:
            if result:
                row={}
                row['title']=get_title_from_result(result)
                row['company']=get_company_from_result(result)
                row['city']=get_location_from_result(result)
                row['summary']=get_summary_from_result(result)
                row['bin']=salary
                row['url'] = get_url(result)
                
                rows.append(row)

with open('UKdata.json', 'w') as outfile:
    json.dump(rows, outfile)

In [None]:
Show Sample of data gathered

In [None]:
rows[3]

## Extracting Detailed Job Description

Using URLs collected earlier and stored in the json file, detailed job description are collected and appended to the existing data and stored.

In [None]:
with open('UKdata.json') as f:
    data = json.load(f)
urls = []    
for j in data:
    url ="https://www.indeed.com/viewjob?"
    ukurl ="https://www.indeed.co.uk/viewjob?"
    ukurl += j['url'].split('?')[1]
    urls.append(ukurl)
urls[1]

count = 0 
for url in urls:
    r=requests.get(url)
    soup=BeautifulSoup(r.content,"lxml")
    result = soup.find('div',{'class':  'jobsearch-ViewJobLayout-jobDisplay'})
    rows[count].update({'JD' : get_full_jd(result)})
    count +=1

### Store data into a json file

The data are stored into a json file and a sample of collected data is shown

In [None]:
import sys
print (sys.stdout.encoding)
with open('ukdata.json', 'w') as outfile:
    json.dump(rows, outfile)
    
rows[1]