# Business Case Overview

1. Determine the industry factors that are most important in predicting the salary amounts for these data.
2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you focus on data-related job postings, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by limiting your search to a single region.



Goal: Scrape your own data from a job aggregation tool like www.seek.com.au in order to collect the data to best answer these two questions

### Directions
In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

# 2. Use urllib and BeautifulSoup to read the contents of the HTML.


I chose to scrape the data from seek.com.au, a Australia based generalist careers site. I use BeautifulSoup to retrieve all the URLs associated with the different job postings using the code indicated below. This was then cleaned to obtain a list containing all the relevant URLs which could then be accessed individually to obtain the jobs data.

In [17]:
import urllib.request, urllib.parse, urllib.error
import requests                   #Crawl and Fetch
from bs4 import BeautifulSoup     #Parse and Extract
import pandas as pd               #Load and Clean
import lxml

from tqdm import tqdm
data = []
job_links = []
for i in tqdm(range(1,120)):

    url_seek = 'https://www.seek.com.au/data-jobs/in-All-Sydney-NSW?'+'page=+i'
    res_seek = requests.get(url_seek)

    soup = BeautifulSoup(res_seek.content, 'lxml')

    
    titles = soup.find_all('article')
    
    
    for title in titles:
    
        link = title.find('div')['href']
    
        job_links.append(link)
        
    
    #for title in titles:
        row = {}
       
        
        row['job_title'] = title.find('h1').find('a').text
        row['job_Salary'] = title.find_all('span', {'class': "lwHBT6d"})
        row['job_Location'] = title.find_all('span')[14].text
        row['job_Description'] = title.find_all('span',{'class': "bl7UwXp"})[0].text

        data.append(row)
df_seek = pd.DataFrame(job_links)
df_data = pd.DataFrame(data)
df_seek

100%|██████████| 119/119 [00:59<00:00,  2.01it/s]


Unnamed: 0,0
0,/job/41347509?type=promoted#searchRequestToken...
1,/job/41353676?type=promoted#searchRequestToken...
2,/job/41358893?type=standout#searchRequestToken...
3,/job/41350834?type=standout#searchRequestToken...
4,/job/41351226?type=standard#searchRequestToken...
...,...
2611,/job/41343505?type=standard#searchRequestToken...
2612,/job/41357910?type=standout#searchRequestToken...
2613,/job/41348762?type=standout#searchRequestToken...
2614,/job/41315475?type=standout#searchRequestToken...


In [18]:
#<span class="_3FrNV7v _3PZrylH E6m4BZb"><span class="lwHBT6d">Full Time</span></span>
#title.find_all('span', {'class':"_3FrNV7v _3PZrylH E6m4BZb"})[2].text




In [19]:
#<span class="lwHBT6d">$80k - $110k p.a.</span>

#title.find_all_next('span')[15].text

In [20]:
#title.find('span class="_3FrNV7v _3PZrylH E6m4BZb"><span class=""><span class="Eadjc1o">at </span><a href="/Hearing-Australia-jobs" rel="nofollow" class="_3AMdmRg" title="Jobs at Hearing Australia" aria-label="Jobs at Hearing Australia" data-automation="jobCompany" target="_self">Hearing Australia</a></span></span>')
#<span data-automation="jobShortDescription" class="_3FrNV7v xxz8a1h _2E1gs92 _3PZrylH E6m4BZb"><span class="bl7UwXp">Hearing Australia has an international reputation in the delivery of world’s best practice hearing rehabilitation services. Work with the best!</span></span>
#title.find_all('span')[1].text


In [21]:
#<span data-automation="jobShortDescription" class="_3FrNV7v xxz8a1h _2E1gs92 _3PZrylH E6m4BZb"><span class="bl7UwXp">Hearing Australia has an international reputation in the delivery of world’s best practice hearing rehabilitation services. Work with the best!</span></span>

#title.find_all('span',{'class': "bl7UwXp"})[0].text

In [22]:
df_seek.shape

(2616, 1)

In [23]:
df_data


Unnamed: 0,job_title,job_Salary,job_Location,job_Description
0,Senior Data Engineer - Australia's #1 Home Loa...,[],location: Sydney,Lendi is looking for a senior data engineer wi...
1,Associate Partner - Cloud Strategy,[],location: Sydney,"The convergence of social, mobile, cloud, big ..."
2,Data Analyst,[[$100k - $110k p.a. + Super + Benefits]],location: Sydney,This role is ideal for someone who enjoys auto...
3,Data Analyst,[[$90k - $105k p.a. + Super + Benefits]],location: Sydney,Work with a market leader in the technology se...
4,Data Analyst,[],location: Sydney,3 month initial contract with view to extend. ...
...,...,...,...,...
2611,Admin/ Data Entry,[],location: Sydney,Looking for someone who is willing to learn a ...
2612,Business Intelligence Analyst,[],location: Sydney,A genuinely exciting and versatile role workin...
2613,Associate Data Collection Specialist,[],location: Sydney,An Associate Data Collection Specialist is res...
2614,Reporting Analyst,[],location: Sydney,Due to securing new business supporting Austra...


In [24]:
import numpy as np
def get_sal(x): 
    try: 
        return x[0] 
    except:
        return np.nan
    
df_data['job_Salary'].apply(get_sal)



0                                           NaN
1                                           NaN
2       [$100k - $110k p.a. + Super + Benefits]
3        [$90k - $105k p.a. + Super + Benefits]
4                                           NaN
                         ...                   
2611                                        NaN
2612                                        NaN
2613                                        NaN
2614                                        NaN
2615                                        NaN
Name: job_Salary, Length: 2616, dtype: object

In [25]:
df_data.head()

Unnamed: 0,job_title,job_Salary,job_Location,job_Description
0,Senior Data Engineer - Australia's #1 Home Loa...,[],location: Sydney,Lendi is looking for a senior data engineer wi...
1,Associate Partner - Cloud Strategy,[],location: Sydney,"The convergence of social, mobile, cloud, big ..."
2,Data Analyst,[[$100k - $110k p.a. + Super + Benefits]],location: Sydney,This role is ideal for someone who enjoys auto...
3,Data Analyst,[[$90k - $105k p.a. + Super + Benefits]],location: Sydney,Work with a market leader in the technology se...
4,Data Analyst,[],location: Sydney,3 month initial contract with view to extend. ...


In [14]:
high_sal = df_data.Reward.str.extractall(pat="\$?(\d+,\d+)").unstack()
high_sal
df_data.Reward.drop(index=high_sal.index)
salary_pa = df_data.Reward.str.extractall(pat="(\d+)k").unstack()
salary_pa
df_data.Reward.loc[560]
high_sal.index.append(salary_pa.index)
df_data.Reward.drop(index=high_sal.index.append(salary_pa.index))
salary_ks = df_data.Reward.str.extractall(pat="\$(\d+).\-.(\d+)").unstack()
salary_ks
salary_ks.index.intersection(high_sal.index)
salary_ks.index.intersection(salary_pa.index)
df_data.Reward.iloc[387]
df_data.Reward.drop(index=high_sal.index.append(salary_pa.index).append(salary_ks.index))

AttributeError: 'DataFrame' object has no attribute 'Reward'