# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

### Project Feedback + Evaluation

For all projects, students will be evaluated on a simple 3 point scale (0, 1, or 2). Instructors will use this rubric when scoring student performance on each of the core project **requirements:** 

Score | Expectations
----- | ------------
**0** | _Does not meet expectations. Try again._
**1** | _Meets expectations. Good job._
**2** | _Surpasses expectations. Brilliant!_

[For more information on how we grade our DSI projects, see our project grading walkthrough.](https://git.generalassemb.ly/dsi-projects/readme/blob/master/README.md)


## Loading the Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
jobs = pd.read_csv('./project-4-starter.csv')

In [3]:
jobs.head()

Unnamed: 0.1,Unnamed: 0,JobId,JobTitle,Company,CompanyAddress,CountryCode,JobType,JobIndustry,JobLevel,SalaryFrom,SalaryTo,Roles,Requirements,JobSite,JobTags,URL
0,0,1284,Data Scientist - Business Analytics,INFINEON TECHNOLOGIES ASIA PACIFIC PTE LTD,"INFINEON, 8 KALLANG SECTOR 349282",SG,Permanent,Information Technology,"Professional, Executive, Senior Executive","$6,000","to$10,000",Roles & Responsibilities\nIn your new role you...,Requirements\nYou are best equipped for this t...,https://www.mycareersfuture.sg/,data scientist,https://www.mycareersfuture.sg/job/data-scient...
1,1,1285,Data Scientist,ALPHATECH BUSINESS SOLUTIONS PTE. LTD.,,SG,"Permanent, Full Time",Information Technology,Professional,"$5,500","to$8,000",Roles & Responsibilities\n· Use machine ...,Requirements\n· Atleast 3+ years of Deep...,https://www.mycareersfuture.sg/,data scientist,https://www.mycareersfuture.sg/job/data-scient...
2,2,1286,Data Scientist,ITPM CONSULTING PTE. LTD.,"BANK OF SINGAPORE CENTRE, 63 MARKET STREET 048942",SG,"Permanent, Full Time",Information Technology,Professional,"$7,000","to$8,500",Roles & Responsibilities\nITPM Consulting Pte ...,Requirements\nResearch and develop statistical...,https://www.mycareersfuture.sg/,data scientist,https://www.mycareersfuture.sg/job/data-scient...
3,3,1289,Data Scientist,GO-JEK SINGAPORE PTE. LTD.,"AXA TOWER, 8 SHENTON WAY 068811",SG,Permanent,Information Technology,Junior Executive,"$6,000","to$8,000",Roles & Responsibilities\nGOJEK is the largest...,Requirements\nWhat we are looking for:\nYou gr...,https://www.mycareersfuture.sg/,data scientist,https://www.mycareersfuture.sg/job/data-scient...
4,4,1290,Data Scientist,GO-JEK SINGAPORE PTE. LTD.,"AXA TOWER, 8 SHENTON WAY 068811",SG,Permanent,Information Technology,Junior Executive,"$6,000","to$8,000",Roles & Responsibilities\nGOJEK is the largest...,Requirements\nWhat we are looking for:\nYou gr...,https://www.mycareersfuture.sg/,data scientist,https://www.mycareersfuture.sg/job/data-scient...


In [4]:
jobs.shape

(930, 16)

In [5]:
jobs.isnull().sum()

Unnamed: 0          0
JobId               0
JobTitle            0
Company             0
CompanyAddress    192
CountryCode         0
JobType             0
JobIndustry         0
JobLevel            5
SalaryFrom          0
SalaryTo            0
Roles               0
Requirements       30
JobSite             0
JobTags             0
URL                 0
dtype: int64

In [6]:
jobs.describe()

Unnamed: 0.1,Unnamed: 0,JobId
count,930.0,930.0
mean,464.5,1789.862366
std,268.612174,292.799042
min,0.0,1284.0
25%,232.25,1539.25
50%,464.5,1790.5
75%,696.75,2035.75
max,929.0,2302.0


## EDA

### Dropping columns and rows

In [7]:
# Unnamed column seems to be the running id, ok to drop
# JobId will not be useful in analysis
# Job location is unlikely to influence salaries as SG is fairly small and does not have urban/rural divides, hence is unlikely to be significant in determining salary
# All jobs are located in SG so they are not useful in analysis
# Using lowest salary estimation is sufficient as company may set higher salary boundary to be unrealistically high to attract interest
# All observations taken from the same website, so can drop JobSite column
# URL is not relevant, can drop
jobs.drop(columns=['Unnamed: 0','JobId','CompanyAddress','CountryCode','SalaryTo','JobSite','URL'],inplace=True)

In [8]:
# as job requirements will be used to predict job title, must drop null values
jobs.dropna(inplace=True)

In [9]:
jobs.shape

(900, 9)

In [10]:
# Checking for duplicates
duplicates = jobs[jobs.duplicated()].index.values

#drop duplicate rows
#reset the index to facilitate indexing for subsequent steps
jobs.drop(index=duplicates,inplace=True)
jobs.reset_index(drop=True,inplace=True)

In [11]:
jobs.shape

(867, 9)

### Changing data types and removing unwanted text

In [12]:
# remove notation in salary details
jobs['SalaryFrom'] = jobs['SalaryFrom'].apply(lambda x:x.replace('$','')).apply(lambda x:x.replace(',',''))
jobs['SalaryFrom'] = pd.to_numeric(jobs['SalaryFrom'])

In [13]:
# clean up roles and requirements column
jobs['Roles'] = jobs['Roles'].apply(lambda x:x.replace('Roles & Responsibilities','')).apply(lambda x:x.replace('\n',' '))
jobs['Requirements'] = jobs['Requirements'].astype('str').apply(lambda x:x.replace('Requirements\n','')).apply(lambda x:x.replace('\n',' '))

### Checking for consistency of data

In [14]:
jobs['JobType'].value_counts()

Full Time                         349
Permanent, Full Time              181
Permanent                         139
Contract, Full Time               116
Contract                           69
Internship                          4
Full Time, Internship               3
Permanent, Contract                 2
Temporary, Contract                 1
Permanent, Contract, Full Time      1
Part Time, Full Time                1
Temporary                           1
Name: JobType, dtype: int64

In [15]:
# 'temporary' appears to be a misclassification of 'contract'
jobs['JobType'] = jobs['JobType'].apply(lambda x:x.replace('Temporary','Contract'))

#dropping extra 'contract'
jobs.at[jobs[jobs['JobType']=='Contract, Contract'].index.values,'JobType'] = 'Contract'

# Part time, Full time doesn't make sense. Impute with Part time as that is likely to be on the lower end of the salary range
jobs['JobType'] = jobs['JobType'].apply(lambda x:x.replace('Part Time, Full Time','Part Time'))

## Feature Engineering

### For Job Type

In [16]:
#create new dataframe column for each job type for boolean results of string search from the original job type column
#replace True with 1 and False with 0 to convert into categorical feature for modelling later

jobs['JobType_Permanent'] = jobs['JobType'].str.contains('Permanent').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobType_Contract'] = jobs['JobType'].str.contains('Contract').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobType_FullTime'] = jobs['JobType'].str.contains('Full Time').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobType_PartTime'] = jobs['JobType'].str.contains('Part Time').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobType_Internship'] = jobs['JobType'].str.contains('Internship').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))

#drop the original job type column to prevent creating noise
jobs.drop(columns='JobType',inplace=True)

### For Job Level

In [17]:
jobs['JobLevel'].unique()

array(['Professional, Executive, Senior Executive', 'Professional',
       'Junior Executive', 'Professional, Senior Executive', 'Executive',
       'Middle Management, Manager', 'Manager', 'Senior Executive',
       'Manager, Professional', 'Senior Management, Manager',
       'Middle Management, Manager, Professional', 'Middle Management',
       'Senior Management', 'Fresh/entry level',
       'Manager, Senior Executive', 'Non-executive',
       'Fresh/entry level, Executive',
       'Executive, Junior Executive, Senior Executive',
       'Manager, Professional, Executive',
       'Non-executive, Junior Executive', 'Professional, Executive',
       'Fresh/entry level, Professional, Junior Executive',
       'Manager, Executive, Senior Executive',
       'Fresh/entry level, Professional', 'Executive, Senior Executive',
       'Executive, Junior Executive', 'Professional, Non-executive',
       'Junior Executive, Senior Executive',
       'Senior Management, Professional',
       'Mid

In [18]:
# job levels are absolute so we must select only one per job title
# Since I used lowest salary, I will impute each title with the lowest job title available
# Job levels in ascending order: Non-executive, Fresh/entry level, Junior Executive, Executive, Senior Executive, Manager, Middle Management, Senior Management, Professional
# create empty list to run if statements

JobLevels = []

for x in jobs['JobLevel']:
    if x.find('Non-executive')!=-1:
        JobLevels.append('Non-executive')
    elif x.find('Fresh/entry level')!=-1:
        JobLevels.append('Fresh/entry level')
    elif x.find('Junior Executive')!=-1:
        JobLevels.append('Junior Executive')
    elif x.find('Senior')==-1 & x.find('Executive')!=1:
        JobLevels.append('Executive')
    elif x.find('Senior Executive')!=-1:
        JobLevels.append('Senior Executive')
    elif x.find('Manager')!=-1:
        JobLevels.append('Manager')
    elif x.find('Middle Management')!=-1:
        JobLevels.append('Middle Management')
    elif x.find('Senior Management')!=-1:
        JobLevels.append('Senior Management')
    else:
        JobLevels.append('Professional')
        
# replace job level column by populating with the list
jobs['JobLevels'] = JobLevels

#drop the original JobLevel column
jobs.drop(columns='JobLevel',inplace=True)

# dummy this variable for modelling later
joblevel_dummies = pd.get_dummies(jobs['JobLevels'],drop_first=True)
jobs = pd.concat([jobs,joblevel_dummies],axis=1)

### For Job Industry

In [19]:
# do the same thing as job type for the industries, but only for the top 10 most frequently occuring industries
# including all of them is likely to introduce unnecessary noise
# even though 'others' is in the top 10, it is likely a mix of many other industries and hence more likely to be noise than signal

jobs['JobIndustry_IT'] = jobs['JobIndustry'].str.contains('Information Technology').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobIndustry_Finance'] = jobs['JobIndustry'].str.contains('Banking and Finance').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobIndustry_Sciences'] = jobs['JobIndustry'].str.contains('Sciences / Laboratory / R&D').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobIndustry_Engineering'] = jobs['JobIndustry'].str.contains('Engineering').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobIndustry_Consulting'] = jobs['JobIndustry'].str.contains('Consulting').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))
jobs['JobIndustry_Manufacturing'] = jobs['JobIndustry'].str.contains('Manufacturing').astype('str').apply(lambda x:x.replace('True','1')).apply(lambda x:x.replace('False','0'))

In [20]:
jobs.head()

Unnamed: 0,JobTitle,Company,JobIndustry,SalaryFrom,Roles,Requirements,JobTags,JobType_Permanent,JobType_Contract,JobType_FullTime,...,Non-executive,Professional,Senior Executive,Senior Management,JobIndustry_IT,JobIndustry_Finance,JobIndustry_Sciences,JobIndustry_Engineering,JobIndustry_Consulting,JobIndustry_Manufacturing
0,Data Scientist - Business Analytics,INFINEON TECHNOLOGIES ASIA PACIFIC PTE LTD,Information Technology,6000,In your new role you will: Drive prove of con...,You are best equipped for this task if you hav...,data scientist,1,0,0,...,0,0,1,0,1,0,0,0,0,0
1,Data Scientist,ALPHATECH BUSINESS SOLUTIONS PTE. LTD.,Information Technology,5500,· Use machine learning and analytical t...,· Atleast 3+ years of Deep Learning and ...,data scientist,1,0,1,...,0,0,0,0,1,0,0,0,0,0
2,Data Scientist,ITPM CONSULTING PTE. LTD.,Information Technology,7000,ITPM Consulting Pte Ltd is inviting Data scie...,Research and develop statistical learning mode...,data scientist,1,0,1,...,0,0,0,0,1,0,0,0,0,0
3,Data Scientist,GO-JEK SINGAPORE PTE. LTD.,Information Technology,6000,GOJEK is the largest consumer technology comp...,What we are looking for: You greatly value hum...,data scientist,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,Data Analyst,NTT DATA SINGAPORE PTE. LTD.,"Banking and Finance, Information Technology",5000,The data analyst will provide the big data an...,"Bachelors/Masters in Computer Science, Statist...",data scientist,0,1,0,...,0,0,1,0,1,1,0,0,0,0


In [21]:
jobs.dtypes

JobTitle                     object
Company                      object
JobIndustry                  object
SalaryFrom                    int64
Roles                        object
Requirements                 object
JobTags                      object
JobType_Permanent            object
JobType_Contract             object
JobType_FullTime             object
JobType_PartTime             object
JobType_Internship           object
JobLevels                    object
Fresh/entry level             uint8
Junior Executive              uint8
Manager                       uint8
Non-executive                 uint8
Professional                  uint8
Senior Executive              uint8
Senior Management             uint8
JobIndustry_IT               object
JobIndustry_Finance          object
JobIndustry_Sciences         object
JobIndustry_Engineering      object
JobIndustry_Consulting       object
JobIndustry_Manufacturing    object
dtype: object

## Question 1

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, LassoCV

In [23]:
jobs.columns

Index(['JobTitle', 'Company', 'JobIndustry', 'SalaryFrom', 'Roles',
       'Requirements', 'JobTags', 'JobType_Permanent', 'JobType_Contract',
       'JobType_FullTime', 'JobType_PartTime', 'JobType_Internship',
       'JobLevels', 'Fresh/entry level', 'Junior Executive', 'Manager',
       'Non-executive', 'Professional', 'Senior Executive',
       'Senior Management', 'JobIndustry_IT', 'JobIndustry_Finance',
       'JobIndustry_Sciences', 'JobIndustry_Engineering',
       'JobIndustry_Consulting', 'JobIndustry_Manufacturing'],
      dtype='object')

In [24]:
# defining target and values dataset as X and y
# I am using JobTags as a substitute of job title
X = jobs.drop(columns=['JobTitle','SalaryFrom','Roles','Requirements'])
y = jobs['SalaryFrom'].values

# doing train-test split of the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

No scaling of the dataset. As there are no numerical features in the training dataset we do not need to scale it (other than the manually dummied categories which should remain as 0s and 1s anyway)

### Model 1: Linear Regression

In [25]:
#building basic linear regression model
lr = LinearRegression()

lr.fit(X_train, y_train)

# score the fit model using the train data
print(lr.score(X_train, y_train))

#compare with test model
print(lr.score(X_test, y_test))

#some dramatic overfitting is happening!

ValueError: could not convert string to float: 'Executive'

In [None]:
optimal_lasso = LassoCV(n_alphas=1000, cv=10, verbose=1, n_jobs= 6)
optimal_lasso.fit(X_train, y_train)

print optimal_lasso.alpha_