# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

### Project Feedback + Evaluation

For all projects, students will be evaluated on a simple 3 point scale (0, 1, or 2). Instructors will use this rubric when scoring student performance on each of the core project **requirements:** 

Score | Expectations
----- | ------------
**0** | _Does not meet expectations. Try again._
**1** | _Meets expectations. Good job._
**2** | _Surpasses expectations. Brilliant!_

[For more information on how we grade our DSI projects, see our project grading walkthrough.](https://git.generalassemb.ly/dsi-projects/readme/blob/master/README.md)


<div class="alert alert-success">

## Qn1: Predict Salary (Factors that affect salary)
- Import the 2 CSV files saved from the EDA and data vectorizer
- One file contains dummy variables for the emp_type ('emp_dummies.csv')
- The other file contains vectorized variables from job_title ('job_vect_title.csv')
- Combine both together
- The dummy variables for emp_type and the vectorized variables for the job_title will be the predictors (X)
  in the model
- While the average_salary will be the y (predicted value)
- split the dataframe into train and test set using train_test_split
- Train on the 2 models


### For Qn 1, the 2 regression model to be used are :
- multiple linear regression(lasso, ridge, elastic net)
- random forest regressor

In [1]:
import pandas as pd
import numpy as np

In [2]:
job_vect_title = pd.read_csv('job_vect_title.csv')
emp_dummies = pd.read_csv('emp_dummies.csv')
job_vect_details = pd.read_csv('job_vect_details.csv')

In [3]:
print(job_vect_title.shape)
print(emp_dummies.shape)
print(job_vect_details.shape)

(3312, 757)
(3312, 50)
(3312, 100)


In [4]:
job_vect_title.head()

Unnamed: 0,aa,aa modif,aa overhaul,aba,academ,account,account admin,account assist,account clerk,account cum,...,web,web develop,west,work,writer,year,year contract,yield,yx,average_salary
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5650.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5750.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5000.0
3,0.0,0.0,0.0,0.0,0.0,0.231502,0.0,0.0,0.0,0.448773,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2900.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4500.0


In [5]:
emp_dummies.head()

Unnamed: 0,company_name,job_title,seniority,job_cat,job_details,low_salary,high_salary,average_salary,salary_range,"Contract, Full Time",...,"Permanent, Full Time, Flexi work","Permanent, Full Time, Internship","Permanent, Temporary, Contract","Permanent, Temporary, Full Time",Temporary,"Temporary, Contract","Temporary, Contract, Full Time","Temporary, Full Time","Temporary, Internship",job_title1
0,IRISNATION SINGAPORE PTE. LTD.,Senior Manager,Manager,Advertising / Media,Roles & ResponsibilitiesWe’re Iris Concise (ww...,4500.0,6800.0,5650.0,medium,0,...,0,0,0,0,0,0,0,0,0,senior manag
1,IRISNATION SINGAPORE PTE. LTD.,Senior Manager,Middle Management,Advertising / Media,Roles & ResponsibilitiesWe’re Iris Concise (ww...,4500.0,7000.0,5750.0,medium,0,...,0,0,0,0,0,0,0,0,0,senior manag
2,IRISNATION SINGAPORE PTE. LTD.,Campaign Manager,Middle Management,Advertising / Media,Roles & ResponsibilitiesWe’re Iris Concise (ww...,4000.0,6000.0,5000.0,medium,0,...,0,0,0,0,0,0,0,0,0,campaign manag
3,THE SUPREME HR ADVISORY PTE. LTD.,Account Cum Admin / / Senior Level / / 2...,Senior Executive,Consulting,Roles & ResponsibilitiesResponsibilities: Exec...,2800.0,3000.0,2900.0,low,0,...,0,0,0,0,0,0,0,0,0,account cum admin senior level payrol cpf clar...
4,THE SUPREME HR ADVISORY PTE. LTD.,Senior Site Engineers / / Electrical / / ...,Senior Executive,Building and Construction,Roles & Responsibilities1) Electrical Minimum...,4000.0,5000.0,4500.0,medium,0,...,0,0,0,0,0,0,0,0,0,senior site engin electr m vac bug is senior l...


In [6]:
# Drop unnecessary columns in emp_dummmies, leaving behind only the dummy variables

emp_dummies2 = emp_dummies.iloc[:,9:]
emp_dummies2 = emp_dummies2.drop(columns='job_title1')

In [7]:
emp_dummies2.head()

Unnamed: 0,"Contract, Full Time","Contract, Full Time, Flexi work","Contract, Full Time, Internship","Contract, Internship",Flexi work,"Freelance, Full Time, Flexi work",Full Time,"Full Time, Flexi work","Full Time, Internship",Internship,...,"Permanent, Full Time","Permanent, Full Time, Flexi work","Permanent, Full Time, Internship","Permanent, Temporary, Contract","Permanent, Temporary, Full Time",Temporary,"Temporary, Contract","Temporary, Contract, Full Time","Temporary, Full Time","Temporary, Internship"
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
job_vect_details.head()

Unnamed: 0,abil,abl,account,activ,analysi,analyt,applic,assist,build,busi,...,technic,technolog,test,time,tool,understand,use,user,work,year
0,0.046827,0.0,0.0,0.109542,0.053212,0.263381,0.0,0.0,0.055427,0.211422,...,0.0,0.0,0.0,0.0,0.0,0.050433,0.094867,0.0,0.208964,0.035964
1,0.046801,0.0,0.0,0.10948,0.053182,0.263232,0.0,0.0,0.055395,0.211302,...,0.0,0.0,0.0,0.0,0.0,0.050405,0.094813,0.0,0.208845,0.035944
2,0.0,0.0,0.0,0.0,0.0,0.059807,0.0,0.0,0.0,0.144026,...,0.061116,0.236647,0.0,0.0,0.0,0.0,0.0,0.0,0.169466,0.040833
3,0.082302,0.0,0.445311,0.0,0.0,0.0,0.0,0.0,0.0,0.297271,...,0.0,0.0,0.0,0.094528,0.0,0.0,0.0,0.0,0.157401,0.12642
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375665,0.226292


In [9]:
# Concat emp_dummies2 and job_vect_title
# Named the combine dataframe as generic df for ease of typing

df = pd.concat([emp_dummies2,job_vect_title,job_vect_details],axis=1)

In [10]:
df.shape
# 757 + 40 columns = 806
# check to ensure concatenation is correct

(3312, 897)

In [11]:
df.tail()

#ensure that row index is correct and consistent with the number of rows

Unnamed: 0,"Contract, Full Time","Contract, Full Time, Flexi work","Contract, Full Time, Internship","Contract, Internship",Flexi work,"Freelance, Full Time, Flexi work",Full Time,"Full Time, Flexi work","Full Time, Internship",Internship,...,technic,technolog,test,time,tool,understand,use,user,work,year
3307,0,0,0,0,0,0,1,0,0,0,...,0.027899,0.081021,0.094957,0.0,0.029569,0.0,0.024584,0.0,0.015472,0.0
3308,0,0,0,0,0,0,1,0,0,0,...,0.116603,0.169311,0.132288,0.0,0.185374,0.218494,0.051374,0.0,0.096997,0.0
3309,0,0,0,0,0,0,1,0,0,0,...,0.14948,0.07235,0.084794,0.0,0.237642,0.350124,0.131719,0.0,0.207242,0.099871
3310,0,0,0,0,0,0,1,0,0,0,...,0.0,0.142062,0.083248,0.073315,0.155541,0.068749,0.064659,0.0,0.0,0.0
3311,0,0,0,0,0,0,1,0,0,0,...,0.07576,0.513362,0.085951,0.0,0.0,0.0,0.0,0.0,0.042014,0.101234


In [12]:
# split the dataset into train and test set, ensuring that the rows must be more than the columns

from sklearn.model_selection import train_test_split

df_x = df.drop(columns='average_salary')
X = df_x
y = df['average_salary']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

print ('X_train:',X_train.shape)
print ('X_test:',X_test.shape)
print ('y_train:',y_train.shape)
print ('y_test:',y_test.shape)


X_train: (1987, 896)
X_test: (1325, 896)
y_train: (1987,)
y_test: (1325,)


In [14]:
# Create another dataframe without job_details for modelling, df1

In [15]:
df1 = pd.concat([emp_dummies2,job_vect_title],axis=1)

In [16]:
df1.shape

(3312, 797)

In [17]:
# split the dataset into train and test set, ensuring that the rows must be more than the columns

from sklearn.model_selection import train_test_split

df1_x = df1.drop(columns='average_salary')
X1 = df1_x
y1 = df1['average_salary']

In [18]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.4, random_state=10)

print ('X1_train:',X1_train.shape)
print ('X1_test:',X1_test.shape)
print ('y1_train:',y1_train.shape)
print ('y1_test:',y1_test.shape)

type(X1_train)

X1_train: (1987, 796)
X1_test: (1325, 796)
y1_train: (1987,)
y1_test: (1325,)


pandas.core.frame.DataFrame

<div class="alert alert-info">

## Modelling for Qn1

In [19]:
# Import necessary libraries

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics



### Linear Regression

In [20]:
print('LINEAR REGRESSION FOR DF (DATASET WITHOUT JOB_DETAILS)')

lr = LinearRegression()
lr_lasso = Lasso(random_state=0, alpha=0.5) # alpha 0 to 1
lr_ridge = Ridge(alpha=0.5)   # alpha 0 to 1
elast = ElasticNet(alpha=0.5, l1_ratio=0.1)   # alpha 0 to 1

lr.fit(X_train, y_train)
lr_lasso.fit(X_train, y_train)
lr_ridge.fit(X_train, y_train)
elast.fit(X_train, y_train)

print ('Accuracy score :')
print ('LinearRegression          : ', lr.score(X_test, y_test))
print ('Lasso                     : ', lr_lasso.score(X_test, y_test))
print ('Ridge                     : ', lr_ridge.score(X_test, y_test))
print ('ElasticNet                : ', elast.score(X_test, y_test))

print('======================================================================')

print ('Cross_val_score :')
print ('LinearRegression    : ', cross_val_score(lr, X_train, y_train, cv=5, n_jobs=-1).mean())
print ('Lasso               : ', cross_val_score(lr_lasso, X_train, y_train, cv=5, n_jobs=-1).mean())
print ('Ridge               : ', cross_val_score(lr_ridge, X_train, y_train, cv=5, n_jobs=-1).mean())
print ('ElasticNet          : ', cross_val_score(elast, X_train, y_train, cv=5, n_jobs=-1).mean())

print('======================================================================')

print ('Cross_val_predict :')
print ('LinearRegression    : ', cross_val_predict(lr, X_test, y_test, cv=5, n_jobs=-1).mean())
print ('Lasso               : ', cross_val_predict(lr_lasso, X_test, y_test, cv=5, n_jobs=-1).mean())
print ('Ridge               : ', cross_val_predict(lr_ridge, X_test, y_test, cv=5, n_jobs=-1).mean())
print ('ElasticNet          : ', cross_val_predict(elast, X_test, y_test, cv=5, n_jobs=-1).mean())

print('======================================================================')

print ('RMSE :')
print('RMSE LinearRegression :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(lr, X_test, y_test, cv=5, n_jobs=-1) )))
print('RMSE Lasso            :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(lr_lasso, X_test, y_test, cv=5, n_jobs=-1) )))
print('RMSE Ridge            :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(lr_ridge, X_test, y_test, cv=5, n_jobs=-1) )))
print('RMSE ElasticNet       :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(elast, X_test, y_test, cv=5, n_jobs=-1) )))

LINEAR REGRESSION FOR DF (DATASET WITHOUT JOB_DETAILS)
Accuracy score :
LinearRegression          :  -1.1742473437286644e+19
Lasso                     :  0.1850014822725098
Ridge                     :  0.3546955003906368
ElasticNet                :  0.02744471494981027
Cross_val_score :
LinearRegression    :  -1.2976101808896777e+25
Lasso               :  -0.3962688454137697
Ridge               :  -0.04379748238779069
ElasticNet          :  0.03034200479288369
Cross_val_predict :
LinearRegression    :  289978889547848.6
Lasso               :  6523.877990915464
Ridge               :  6465.842399363487
ElasticNet          :  6342.915832961627
RMSE :
RMSE LinearRegression : 2.189446959597532e+16
RMSE Lasso            : 7785.933427502546
RMSE Ridge            : 6184.04816041895
RMSE ElasticNet       : 6699.379932451972


In [21]:
print('LINEAR REGRESSION FOR DF1 (DATASET WITH JOB_DETAILS)')

lr = LinearRegression()
lr_lasso = Lasso(random_state=0, alpha=0.5) # alpha 0 to 1
lr_ridge = Ridge(alpha=0.5)   # alpha 0 to 1
elast = ElasticNet(alpha=0.5, l1_ratio=0.1)   # alpha 0 to 1

lr.fit(X1_train, y1_train)
lr_lasso.fit(X1_train, y1_train)
lr_ridge.fit(X1_train, y1_train)
elast.fit(X1_train, y1_train)

print ('Accuracy score :')
print ('LinearRegression          : ', lr.score(X1_test, y1_test))
print ('Lasso                     : ', lr_lasso.score(X1_test, y1_test))
print ('Ridge                     : ', lr_ridge.score(X1_test, y1_test))
print ('ElasticNet                : ', elast.score(X1_test, y1_test))

print('======================================================================')

print ('Cross_val_score :')
print ('LinearRegression    : ', cross_val_score(lr, X1_train, y1_train, cv=5, n_jobs=-1).mean())
print ('Lasso               : ', cross_val_score(lr_lasso, X1_train, y1_train, cv=5, n_jobs=-1).mean())
print ('Ridge               : ', cross_val_score(lr_ridge, X1_train, y1_train, cv=5, n_jobs=-1).mean())
print ('ElasticNet          : ', cross_val_score(elast, X1_train, y1_train, cv=5, n_jobs=-1).mean())

print('======================================================================')

print ('Cross_val_predict :')
print ('LinearRegression    : ', cross_val_predict(lr, X1_test, y1_test, cv=5, n_jobs=-1).mean())
print ('Lasso               : ', cross_val_predict(lr_lasso, X1_test, y1_test, cv=5, n_jobs=-1).mean())
print ('Ridge               : ', cross_val_predict(lr_ridge, X1_test, y1_test, cv=5, n_jobs=-1).mean())
print ('ElasticNet          : ', cross_val_predict(elast, X1_test, y1_test, cv=5, n_jobs=-1).mean())

print('======================================================================')

print ('RMSE :')
print('RMSE LinearRegression :', np.sqrt(metrics.mean_squared_error(y1_test,cross_val_predict(lr, X1_test, y1_test, cv=5, n_jobs=-1) )))
print('RMSE Lasso            :', np.sqrt(metrics.mean_squared_error(y1_test,cross_val_predict(lr_lasso, X1_test, y1_test, cv=5, n_jobs=-1) )))
print('RMSE Ridge            :', np.sqrt(metrics.mean_squared_error(y1_test,cross_val_predict(lr_ridge, X1_test, y1_test, cv=5, n_jobs=-1) )))
print('RMSE ElasticNet       :', np.sqrt(metrics.mean_squared_error(y1_test,cross_val_predict(elast, X1_test, y1_test, cv=5, n_jobs=-1) )))

LINEAR REGRESSION FOR DF1 (DATASET WITH JOB_DETAILS)
Accuracy score :
LinearRegression          :  -3.63178835159095e+22
Lasso                     :  0.15697959533655714
Ridge                     :  0.3354452711243081
ElasticNet                :  0.011351835586872805
Cross_val_score :
LinearRegression    :  -4.50175264420817e+21
Lasso               :  -0.33344230518746587
Ridge               :  -0.002244586804908599
ElasticNet          :  0.009706266057962543
Cross_val_predict :
LinearRegression    :  1642804130087410.8
Lasso               :  6531.586046196968
Ridge               :  6473.411140409444
ElasticNet          :  6342.478073894753
RMSE :
RMSE LinearRegression : 4.1846613072355336e+16
RMSE Lasso            : 7911.634107745522
RMSE Ridge            : 6384.957703993488
RMSE ElasticNet       : 6761.640909135134


In [22]:
# By observing the modelling using both with job_detail and without, 
# it seems that the models perform better when job_details are NOT use for the modelling
# Even though linear regression models are bad, we can observe the difference in the performance

### Random Forest Regressor

#### Modelling of data set without job details

In [23]:
rfr = RandomForestRegressor(n_estimators = 10, random_state = 10)

rfr.fit(X_train, y_train)

print('RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)')

print ('Cross_val_predict  :', cross_val_predict(rfr, X_test, y_test, cv=5, n_jobs=-1).mean())
print ('Accuracy Score     : ', rfr.score(X_test, y_test))
print('RMSE                :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(rfr, X_test, y_test, cv=5, n_jobs=-1) )))

RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)
Cross_val_predict  : 6587.07881349506
Accuracy Score     :  0.3670918718213345
RMSE                : 6404.898667679883


In [24]:
# try to loop through a range of trees or n values from 5 to 30 and look at the score trend
print('RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)')

for n in range(5,30):
    rfr = RandomForestRegressor(n_estimators = n, random_state = 10,n_jobs=-1)
    rfr.fit(X_train,y_train)
    print('n={}:'.format(n),rfr.score(X_test, y_test))

RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)
n=5: 0.41628751574305733
n=6: 0.41375435938191973
n=7: 0.4010797862272353
n=8: 0.39698696410961576
n=9: 0.3632621326227645
n=10: 0.3670918718213345
n=11: 0.4081281378564558
n=12: 0.42475404445421117
n=13: 0.429062736625921
n=14: 0.43896932140007705
n=15: 0.4278353541559378
n=16: 0.4302538340596409
n=17: 0.4526748803289141
n=18: 0.43179943064100473
n=19: 0.43391578253562013
n=20: 0.43306035285362554
n=21: 0.4352656651618335
n=22: 0.42435659235459655
n=23: 0.41762493402753775
n=24: 0.4069044867909759
n=25: 0.4186833831710845
n=26: 0.4036941382168461
n=27: 0.40627414200489953
n=28: 0.3997464291143709
n=29: 0.3950083091136599


In [25]:
# from the looks of it, the best n value = 17 where the score is at least 0.45.
# This is better than linear regression where best score is 0.35 (Ridge regression without including job details)

In [26]:
# Try an extreme n value of 100 and see what is the score
rfr = RandomForestRegressor(n_estimators = 100, random_state = 10,n_jobs=-1)
rfr.fit(X_train,y_train)
print('n=100:'.format(n),rfr.score(X_test, y_test))

n=100: 0.42713877668701883


In [27]:
# from the above run, larger n did not really make huge difference in terms of accuracy score
# futhermore, the model took more time to run.

In [28]:
# Using the best performing n = 17, run the model again to get the performance score

rfr = RandomForestRegressor(n_estimators = 17, random_state = 10)

rfr.fit(X_train, y_train)

print('RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)')

print ('Cross_val_predict  :', cross_val_predict(rfr, X_test, y_test, cv=5, n_jobs=-1).mean())
print ('Accuracy Score     : ', rfr.score(X_test, y_test))
print('RMSE                :', np.sqrt(metrics.mean_squared_error(y_test,cross_val_predict(rfr, X_test, y_test, cv=5, n_jobs=-1) )))

RANDOM FOREST REGRESSOR FOR DF (DATASET WITHOUT JOB_DETAILS)
Cross_val_predict  : 6665.126087860049
Accuracy Score     :  0.4526748803289141
RMSE                : 6501.098815114811


#### Modelling of data set with job details

In [29]:
rfr = RandomForestRegressor(n_estimators = 10, random_state = 10)

rfr.fit(X1_train, y1_train)

print('RANDOM FOREST REGRESSOR FOR DF1 (DATASET WITH JOB_DETAILS)')

print ('Cross_val_predict  :', cross_val_predict(rfr, X1_test, y_test, cv=5, n_jobs=-1).mean())
print ('Accuracy Score     : ', rfr.score(X1_test, y1_test))
print('RMSE                :', np.sqrt(metrics.mean_squared_error(y1_test,cross_val_predict(rfr, X1_test, y1_test, cv=5, n_jobs=-1) )))

RANDOM FOREST REGRESSOR FOR DF1 (DATASET WITH JOB_DETAILS)
Cross_val_predict  : 6357.303213680499
Accuracy Score     :  0.3472793023592502
RMSE                : 7271.3722067440585


In [30]:
# try to loop through a range of trees or n values from 5 to 30 and look at the score trend
print('RANDOM FOREST REGRESSOR FOR DF1 (DATASET WITH JOB_DETAILS)')

for n in range(5,30):
    rfr = RandomForestRegressor(n_estimators = n, random_state = 10,n_jobs=-1)
    rfr.fit(X1_train,y1_train)
    print('n={}:'.format(n),rfr.score(X1_test, y1_test))

RANDOM FOREST REGRESSOR FOR DF1 (DATASET WITH JOB_DETAILS)
n=5: 0.42521500657962397
n=6: 0.39520165510544125
n=7: 0.38311473178402544
n=8: 0.3695777047978892
n=9: 0.3271886341977017
n=10: 0.3472793023592502
n=11: 0.36275888259889455
n=12: 0.3849624958094882
n=13: 0.3956107328842624
n=14: 0.40431143487679044
n=15: 0.3854319942437521
n=16: 0.3889684341922074
n=17: 0.4155049854346224
n=18: 0.3885409824557181
n=19: 0.39423148729979496
n=20: 0.3924605087415937
n=21: 0.3988617069284267
n=22: 0.3910563371923119
n=23: 0.38934960635422544
n=24: 0.3796504962797169
n=25: 0.3957527207354268
n=26: 0.3749617073666208
n=27: 0.3745133005949025
n=28: 0.3683560570778916
n=29: 0.3624289942429567


In [31]:
# Try an extreme n value of 100 and see what is the score
rfr = RandomForestRegressor(n_estimators = 100, random_state = 10)
rfr.fit(X1_train,y1_train)
print('n=100:'.format(n),rfr.score(X1_test, y1_test))

n=100: 0.4058461290665194


In [32]:
# Similar to LinearRegression, for random forest regressor, the model perform poorer 
# when job_details are included. 
# However, the performance drop is not as bad. 


### Conclusion: for Qn 1, random forest regressor works better and performance is more consistent as compared to Linear regression models.                                                 In terms of the factors, the salary is better predicted using employment type and job title

<div class="alert alert-success">

## Qn2: Factors that distinguish job category
- Add job_title column to both df and df1
- Create job_df and job_df1 (df is with job details, df1 is without job details)
- Create dummy variable for job_title that contains the word 'data' 
- For roles that have data, value = 1, those without value = 0
- Thus, it will become a binary classification dataset
- split the dataframe into train and test set using train_test_split
- Train on the 2 models
- Model tried: Logistic Regression and Decision Tree classifier

In [33]:
emp_dummies['job_title']

0                                          Senior Manager
1                                          Senior Manager
2                                        Campaign Manager
3       Account Cum Admin  /  /  Senior Level  /  /  2...
4       Senior Site Engineers  /  /  Electrical  /  / ...
5       Vehicle Technicians  /  /  Electrician  /  /  ...
6       Warehouse Assistant  /  /  Lavender  /  /  150...
7                                      Strategy Consultat
8                                      Associate Director
9       Senior Network Manager (Projects, ITSM, ITIL, ...
10      Network Architect (Projects, Operations) - per...
11      Test Engineer [test script /  information tech...
12                           Assistant Engineer (Plating)
13                                 Pre-Approval Evaluator
14                                   Assistant HR Manager
15                    Infrastructure Technical Specialist
16      Lecturer - Info-Comm Technology (IT Applicatio...
17      Lectur

In [34]:
# Concat the job_title to the df and df1 and drop the 'average_salary' column

In [35]:
# df_job is Dataframe with job_details
df_job = pd.concat([emp_dummies['job_title'],df],axis=1)

# df_job1 is Dataframe without job_details
df_job1 = pd.concat([emp_dummies['job_title'],df1],axis=1)

In [36]:
df_job.shape

(3312, 898)

In [37]:
df_job1.shape

(3312, 798)

In [38]:
# Since the predicted values are the job_cat, we can drop the average salary column

In [39]:
df_job.drop(columns='average_salary',inplace=True)

In [40]:
df_job1.drop(columns='average_salary',inplace=True)

In [41]:
df_job.shape

(3312, 897)

In [42]:
df_job1.shape

(3312, 797)

In [43]:
# Filter out the job title that contains the word 'data' in it
# Out of 3312 rows, there are only 323 rows with data in it
df_job[df_job.job_title.str.contains(pat='Data')]

Unnamed: 0,job_title,"Contract, Full Time","Contract, Full Time, Flexi work","Contract, Full Time, Internship","Contract, Internship",Flexi work,"Freelance, Full Time, Flexi work",Full Time,"Full Time, Flexi work","Full Time, Internship",...,technic,technolog,test,time,tool,understand,use,user,work,year
35,Data Modeler,0,0,0,0,0,0,0,0,0,...,0.000000,0.044093,0.000000,0.000000,0.000000,0.021338,0.000000,0.028011,0.063152,0.015216
37,Senior ETL and Data Engineer,0,0,0,0,0,0,1,0,0,...,0.358832,0.115785,0.000000,0.000000,0.000000,0.056032,0.210797,0.000000,0.331662,0.039957
38,Senior ETL and Data Engineer,0,0,0,0,0,0,1,0,0,...,0.359936,0.116142,0.000000,0.000000,0.000000,0.056205,0.211446,0.000000,0.332682,0.040080
39,Data Scientist,0,0,0,0,0,0,0,0,0,...,0.062629,0.060626,0.000000,0.125150,0.132756,0.058678,0.055187,0.000000,0.138928,0.000000
40,Big Data Engineer (Financial Services),0,0,0,0,0,0,1,0,0,...,0.046162,0.089371,0.000000,0.092244,0.195701,0.216248,0.162708,0.000000,0.153600,0.030842
41,Big Data Engineer (Financial Services),0,0,0,0,0,0,1,0,0,...,0.046162,0.089371,0.000000,0.092244,0.195701,0.216248,0.162708,0.000000,0.153600,0.030842
42,Data Scientists,0,0,0,0,0,0,1,0,0,...,0.000000,0.000000,0.000000,0.000000,0.143917,0.000000,0.358964,0.000000,0.000000,0.090723
43,Data Scientist,0,0,0,0,0,0,0,0,0,...,0.000000,0.280858,0.000000,0.000000,0.000000,0.203875,0.127832,0.000000,0.040225,0.193846
44,Data Scientist,0,0,0,0,0,0,1,0,0,...,0.000000,0.121399,0.000000,0.000000,0.000000,0.000000,0.110509,0.000000,0.278194,0.000000
45,Data Scientist,0,0,0,0,0,0,1,0,0,...,0.000000,0.121399,0.000000,0.000000,0.000000,0.000000,0.110509,0.000000,0.278194,0.000000


In [44]:
# Put in 1 on data_role if the job_title contains the word 'data'
df_job['data_role'] = df_job.job_title.str.contains(pat='Data').astype(int)

In [45]:
# Check that integer value is put in the right row
df_job[df_job.job_title.str.contains(pat='Data')]['data_role']

35      1
37      1
38      1
39      1
40      1
41      1
42      1
43      1
44      1
45      1
46      1
47      1
48      1
50      1
108     1
134     1
178     1
215     1
216     1
217     1
219     1
220     1
221     1
222     1
223     1
224     1
236     1
339     1
360     1
361     1
       ..
2806    1
2880    1
2960    1
2961    1
2962    1
2963    1
2964    1
2965    1
2966    1
2967    1
2968    1
2972    1
3044    1
3091    1
3100    1
3111    1
3112    1
3113    1
3114    1
3115    1
3209    1
3210    1
3211    1
3212    1
3213    1
3214    1
3215    1
3258    1
3294    1
3295    1
Name: data_role, Length: 323, dtype: int64

In [47]:
 # Split the dataframe into X and y

X = df_job.drop(columns=['data_role','job_title'])
y = df_job['data_role']

In [48]:
df_job.shape

(3312, 898)

In [49]:
print (X.shape)
print (y.shape)

(3312, 896)
(3312,)


In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

print ('X_train:',X_train.shape)
print ('X_test:',X_test.shape)
print ('y_train:',y_train.shape)
print ('y_test:',y_test.shape)

X_train: (1987, 896)
X_test: (1325, 896)
y_train: (1987,)
y_test: (1325,)


In [51]:
# Fitting the different classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc


logr = LogisticRegression(random_state = 0, n_jobs=1)
logr_lasso = LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=100, cv=10, n_jobs=1)
logr_ridge = LogisticRegressionCV(penalty='l2', Cs=200, cv=5, n_jobs=1)
dtc = DecisionTreeClassifier(max_depth=None)


logr.fit(X_train, y_train)
logr_lasso.fit(X_train, y_train)
logr_ridge.fit(X_train, y_train)
dtc.fit(X_train, y_train)


print ('Classification Report (precision/recall/f1-score/support)')
print ('LogisticRegression             : ' , classification_report(y_test, logr.predict(X_test)))
print ('LogisticRegressionCV L1 Lasso  : ' , classification_report(y_test, logr_lasso.predict(X_test)))
print ('LogisticRegressionCV L2 Ridge  : ' , classification_report(y_test, logr_ridge.predict(X_test)))
print ('DecisionTreeClassifier         : ' , classification_report(y_test, dtc.predict(X_test)))




Classification Report (precision/recall/f1-score/support)
LogisticRegression             :                precision    recall  f1-score   support

           0       0.97      1.00      0.98      1180
           1       0.97      0.78      0.86       145

   micro avg       0.97      0.97      0.97      1325
   macro avg       0.97      0.89      0.92      1325
weighted avg       0.97      0.97      0.97      1325

LogisticRegressionCV L1 Lasso  :                precision    recall  f1-score   support

           0       0.99      1.00      0.99      1180
           1       0.96      0.94      0.95       145

   micro avg       0.99      0.99      0.99      1325
   macro avg       0.98      0.97      0.97      1325
weighted avg       0.99      0.99      0.99      1325

LogisticRegressionCV L2 Ridge  :                precision    recall  f1-score   support

           0       0.99      1.00      0.99      1180
           1       0.96      0.92      0.94       145

   micro avg       0.9



### Executive Summary
The 2 goals of this project are:
1. Predict salary value from the various job details
2. Predict job category from the various job details

The approach is:
1. Data is scrap from myfuturecarrer.gov.sg. On the website the keyword that was searched is 'data'. In total, there were about 4000 job posting with the keyword 'data'
2. The URL was first scrap from the website and subsequently use to scrap the details of each posting.
3. Exploratory data analysis (EDA) was carried out which includes cleaning, feature engineering of columns, dropping null values and visualisation. 
4. Natural language processing (NLP) techniques were used to tokenize to create variables for text string data such as job description. NLP can help to determine keywords inside the job description.
5. The next steps are to use modelling techniques to fit the cleaned data and assessed the model's performance.

Modelling:
1. For question 1, the prediction is to predict a salary value. In this case, regression models were used. In this project, linear regression  were used together with ridge, lasso and elastic net regularization technique. 
2. For question 2, the prediction is to predict job categories with data roles. In this case, classification models were used. The models used were logistic regression, CV and decision tree classifiers.

Conclusion:
1. For question 1, the finding was that liner regression model do not work well as the variables and predicted values do not have a linear relationship. However, random forest regressor works better. Another finding was that the model works better without including job description.

2. For question 2, much more work needs to be done in terms of handling class imbalances. The jobs with 'data' in them took up only 10% of the entire dataset (300+ out of 3000+). Simple fitting of the model will not get good results. 