### Scrape and Pre-processing 

- Select a website to scrape
- Scrape and prepare your own data.
- Select and parse data from at least 1000 postings for jobs, potentially from multiple location searches

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:

- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. Communication of your process is key. Note that most listings DO NOT come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

Create and compare at least two models for each section. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).

- Section 1: Job Salary Trends
- Section 2: Job Category Factors

job related:
  - Machine learning 
  - Data analyst
  - Data engineer
  - Data Science
  - Business intelligence analyst
  - Data consultant
  - Marketing Data analyst
  - Insights analyst
  - Customer 

### Section 1: Job Salary Trend

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

In [2]:
URL = 'https://www.indeed.com.sg/jobs?q=Data+Scientist&l=Singapore&start=10'
#conducting a request of the stated URL above:
page = requests.get(URL)

#specifying a desired format of “page” using the html parser 
#- this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, 'html.parser')


#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="//d3fw5vlhllyvee.cloudfront.net/s/a934afe/en_SG.js" type="text/javascript">
  </script>
  <link href="//d3fw5vlhllyvee.cloudfront.net/s/ecdfb5e/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://www.indeed.com.sg/rss?q=Data+Scientist&amp;l=Singapore" rel="alternate" title="Data Scientist Jobs, careers in Singapore" type="application/rss+xml"/>
  <link href="/m/jobs?q=Data+Scientist&amp;l=Singapore" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=Data+Scientist&amp;l=Singapore" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
        window['closureReadyCallbacks'] = [];
    }

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureRea

In [3]:
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
            jobs.append(a['title'])
    return(jobs)

extract_job_title_from_result(soup)

['3 x Senior Data Scientist Hires',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'SVP, Data Scientist, Data Management, Technology and Operations',
 'Data Analyst',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist (Machine Learning)',
 'Data Scientist (Asia Pacific)',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Assistant Manager, Data Scientist',
 'Data Scientist',
 'Customer Support Engineer - APAC']

In [4]:
extract_job_title_from_result(soup)

['3 x Senior Data Scientist Hires',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'SVP, Data Scientist, Data Management, Technology and Operations',
 'Data Analyst',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist (Machine Learning)',
 'Data Scientist (Asia Pacific)',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Assistant Manager, Data Scientist',
 'Data Scientist',
 'Customer Support Engineer - APAC']

In [5]:
def extract_company_from_result(soup): 
    companies = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        company = div.find_all(name='span', attrs={'class':'company'})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
            for span in sec_try:
                companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(soup)

['Argyll Scott Singapore Pte Ltd, EA Licence No: 11C3721',
 'Accenture',
 'iKas International, EA Licence No: 16S8086',
 'Singtel',
 'DBS Bank',
 'Carousell',
 'Centre for Strategic Infocomm Technologies (CSIT)',
 'MSD',
 'Grab Taxi',
 'ThreatMetrix',
 'Shopee',
 'Slicebread',
 'Mediacorp Pte Ltd',
 'The Bank of Tokyo-Mitsubishi UFJ Ltd',
 'Shell Infotech Pte Ltd',
 'Alteryx, Inc.']

In [6]:
def extract_location_from_result(soup): 
    locations = []
    spans = soup.findAll('span', attrs={'class': 'location'})
    for span in spans:
        locations.append(span.text)
    return(locations)
extract_location_from_result(soup)

['Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore',
 'Singapore']

In [7]:
def extract_salary_from_result(soup): 
    salaries = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        try:
            salaries.append(div.find('nobr').text)
        except:
            try:
                div_two = div.find(name='div', attrs={'class':'sjcl'})
                div_three = div_two.find('div')
                salaries.append(div_three.text.strip())
            except:
                salaries.append('Nothing_found')
    return(salaries)


extract_salary_from_result(soup)

['Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found']

In [8]:
def extract_summary_from_result(soup): 
    summaries = []
    spans = soup.findAll('span', attrs={'class': 'summary'})    
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)



extract_summary_from_result(soup)

['Analyse unstructured data and clean the data sets using machine learning algorithms. Familiar with applying data science within the financial industry....',
 'As well as be exposed to the challenges of using statistics in a business setting, such as incomplete data, biased data, large data sets, low signal-to-noise...',
 '"Personal data collected will be used for recruitment purposes only" Looking for a Principal Data Scientist for a global insurance firm....',
 'Experience working with very large data sets, including statistical analyses, data visualization, data mining, and data cleansing/transformation and machine...',
 'Experience of architecture design / deployment of big data and analytic infrastructure at Enterprise level. We are looking for someone from tier 1 player in the...',
 'Be able to identify gaps and issues in our data and work collaboratively with the broader Data team to propose solutions which can better increase our data...',
 'Data mining and knowledge discovery

In [9]:
max_results_per_city = 100
city_set = ['Singapore']
#asean_set = ['Kuala+Lumpur','Bangkok','Jakarta', 'Manila', 'Ho+Chi+Minh+City', 'Yangon', 'Hanoi', 'Brunei']
#city_set = ['Singapore','New+York','Chicago','San+Francisco', 'Austin', 'Seattle', 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Washington+DC', 'Boulder']

columns = ['city', 'job_title', 'company_name', 'location', 'summary', 'salary']
sample_df = pd.DataFrame(columns = columns)
sample_df

Unnamed: 0,city,job_title,company_name,location,summary,salary


In [10]:
#scraping code:
for city in city_set:
    for start in range(0, max_results_per_city):
        page = requests.get('http://www.indeed.com/jobs?q=data+scientist&l=' + str(city) + '&start=' + str(start))

    for div in soup.find_all(name='div', attrs={'class':'row'}): 
        #specifying row num for index of job posting in dataframe
        num = (len(sample_df) + 1) 
        #creating an empty list to hold the data for each posting
        job_post = [] 
        #append city name
        job_post.append(city) 
        #grabbing job title
        for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
            job_post.append(a['title']) 
        #grabbing company name
        company = div.find_all(name='span', attrs={'class':'company'}) 
        if len(company) > 0: 
            for b in company:
                job_post.append(b.text.strip()) 
        else: 
            sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
            for span in sec_try:
                job_post.append(span.text) 
        #grabbing location name
        c = div.findAll('span', attrs={'class': 'location'}) 
        for span in c: 
            job_post.append(span.text) 
        #grabbing summary text
        d = div.findAll('span', attrs={'class': 'summary'}) 
        for span in d:
            job_post.append(span.text.strip()) 
        #grabbing salary
        try:
            job_post.append(div.find('nobr').text) 
        except:
            try:
                div_two = div.find(name='div', attrs={'class':'sjcl'}) 
                div_three = div_two.find('div') 
                job_post.append(div_three.text.strip())
            except:
                job_post.append('Nothing_found') 
        #appending list of job post info to dataframe at index num
        sample_df.loc[num] = job_post

#saving sample_df as a local csv file — define your own local path to save contents 
sample_df.to_csv('[filepath].csv', encoding='utf-8')


In [11]:
indeed = pd.read_csv('./[filepath].csv')
indeed


Unnamed: 0.1,Unnamed: 0,city,job_title,company_name,location,summary,salary
0,1,Singapore,3 x Senior Data Scientist Hires,"Argyll Scott Singapore Pte Ltd, EA Licence No:...",Singapore,Analyse unstructured data and clean the data s...,Nothing_found
1,2,Singapore,Data Scientist,Accenture,Singapore,As well as be exposed to the challenges of usi...,Nothing_found
2,3,Singapore,Data Scientist,"iKas International, EA Licence No: 16S8086",Singapore,"""Personal data collected will be used for recr...",Nothing_found
3,4,Singapore,Data Scientist,Singtel,Singapore,"Experience working with very large data sets, ...",Nothing_found
4,5,Singapore,"SVP, Data Scientist, Data Management, Technolo...",DBS Bank,Singapore,Experience of architecture design / deployment...,Nothing_found
5,6,Singapore,Data Analyst,Carousell,Singapore,Be able to identify gaps and issues in our dat...,Nothing_found
6,7,Singapore,Data Scientist,Centre for Strategic Infocomm Technologies (CSIT),Singapore,Data mining and knowledge discovery. Work with...,Nothing_found
7,8,Singapore,Data Scientist,MSD,Singapore,The Data Scientist will:. Data Science colleag...,Nothing_found
8,9,Singapore,Data Scientist (Machine Learning),Grab Taxi,Singapore,Develop creative algorithms by employing machi...,Nothing_found
9,10,Singapore,Data Scientist (Asia Pacific),ThreatMetrix,Singapore,"As a Data Scientist consultant, you’ll use dat...",Nothing_found


In [12]:
indeed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 7 columns):
Unnamed: 0      16 non-null int64
city            16 non-null object
job_title       16 non-null object
company_name    16 non-null object
location        16 non-null object
summary         16 non-null object
salary          16 non-null object
dtypes: int64(1), object(6)
memory usage: 976.0+ bytes


In [13]:
indeed.salary.value_counts()

Nothing_found    16
Name: salary, dtype: int64

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:

- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

Discover which features have the greatest importance when determining a low vs. high paying job.
- Your Boss is interested in what overall features hold the greatest significance.
- HR is interested in which SKILLS and KEY WORDS hold the greatest significance.