# Findings for Scraping Job Postings

## Introduction

#### Source
The job postings were obtained from scraping the MyCareersFuture, a Singapore job portal to allow people to search for jobs based on jobseekers' skills and competencies.

#### Dataset
The job portal provides the following information:
- Job Title
- Company Name and Address
- Job Description and Requirements
- Employment Type (Full Time, Contract, Permanent, Temporary etc)
- Job Seniority (Entry-Level, Executive, Management etc)
- Job Category (IT, Advertising, Banking etc)
- Salary Range (Min, Max)
- Skills Required

A total of 1093 job postings were obtained from the portal.

## Section 1: Job Salary Trends

#### Factors that impact Salary

A total of 5 approaches were used to predict the salary:
- Classification (LogisticRegression) using Employment Type, Seniority and Category
- Classification (LogisticRegression) for Low/Medium/High salary groups using Employment Type, Seniority and Category
- Regression (LinearRegression) for Low/Medium/High salary groups using Employment Type, Seniority and Category
- Classification (CountVectorizer + MultinomialNB) using Job Requirements
- Classification (CountVectorizer + LogisticRegression) using Job Requirements

Approach 1 score of 0.112957 was set as the baseline score, as the classification was the simplest (i.e. just classify based on the 3 factors).

Approach 3 provided the best accuracy score of 0.604. 
- Mean Error: 394.45
- Max Error: 4714.61
- Min Error: - 28673.52
- Standard Deviation: 3097.94
- Score: 0.604061

Looking at the 45 coefficients (comprising of employment type, job seniority and job categories) of the linear regression model, it was determined that the top 10 factors (in the order of importance) that affect the salary are:
- Seniority: Senior Management (5.1%)
- Seniority: Middle Management (4.5%)
- Category: Banking and Finance (3.5%)
- Seniority: Manager (2.8%)
- Category: Consulting (2.1%)
- Seniority: Senior Executive (1.5%)
- Category: Engineering (1.5%)
- Category: Design (1.2%)
- Seniority: Professional (1.1%)
- Seniority: Non-Executive (1.1%)

## Section 2: Job Category Factors

#### Components of Job Posing that disguish data scientists from other data jobs

The required skillset of the job posting was used to analyse if the skillset of the data scientist is different from other data jobs. i.e. if we can predict that the job is a data scientist based on the skillset, then the skillset distinguishes data scientist from other data jobs. There are a total of 622 skills.

To avoid imbalanced dataset, 500 samples of data scientist and 500 samples of non data scientist jobs were used. This provides a baseline score of 0.5.

A Logistic Regression (Classfication) model using skill sets as input provides prediction of an accuracy score of 0.84. (i.e. 84 out of 100 predictions are accurate). The model indicated that the top 10 skills that Data Scientist requires are (along with importance):
- MPI (3.9%)
- CRM (2.1%)
- SPSS (2%)
- SSIS (1.9%)
- Strategy (1.7%)
- Telecommunications (1.5%)
- HTML (1.4%)
- GCP (1.4%)
- EDC (1.4%)
- Valuation (1.3%)

#### Features important for distinguishing junior vs senior positions

There are a total of 9 seniority types listed in the job portal. As some job were applicable to more than 1 seniority type, these jobs were removed to provide a more accurate model.

The 9 seniority types were then classified into 2 groups: 
(1) Junior: Non Executive, Junior Executive, Executive, Senior Executive, Fresh/Entry Level, Professional
(2) Senior: Manager, Middle Management, Senior Management

With this junior/senior position classification, the baseline accuracy is 0.7.

A total of 2 approaches were used to predict the salary:
- Classification (LogisticRegression) using Skill set
- Classification (CountVectorizer + MultinomialNB) using Job Titles

Approach 2 (i.e. Classification using Job Titles) provides a better accuracy score of 0.846. This means that 84 out of 100 predictions would be correct.

The classification model has determined that the following 10 keywords (in the order of importance) within the job title would indicate a senior position:
- hadoop
- principal
- specialist
- manager
- processing
- lead
- procurement
- business
- scientist
- intelligence

#### Difference in requirements for titles vary significantly with industry (e.g. healthcare vs government).

Job postings with 'Manager' in the job title was used to determine if job requirements for healthcare differs from job requirements from government. i.e. if we can predict that the job posting is a government job based on the skills required, there is a difference in requirements between the two industries.

To avoid imbalanced dataset, 500 samples of manager in healthcare industry and 500 samples of manager in government jobs were used. This provides a baseline score of 0.5.

Using a classification model (LogisticRegression) with skill set as inputs, the model provides an accuracy score of 0.897. (89 of 100 predictions are accurate).

The classification model has determined that the following 10 keywords (in the order of importance) within the job skill would indicate a government position:
- Budget
- Construction
- SQL
- Databases
- EDC
- Research
- GCP
- Access
- HTML
- Visio