# Salary Predictions Based on Job Descriptions

# Part 1 - DEFINE

### ---- 1 Define the problem ----

This project intends to predict salaries for given job descriptions.

Job descriptions have eight __features__:
* __jobId__ = a unique index for each job
* __companyId__ = a categorical ID for the company the job is for
* __jobType__ = a categorical feature describing the role
* __degree__ = a categorical feature describing the required education level
* __major__ = a categorical feature conveying the field in which a degree is required, if any
* __industry__ = a categorical feature describing the industry to which the job belongs
* __yearsExperience__ = a numerical feature measuring how many years of work experience are required for the role
* __milesFromMetropolis__ = a numerical feature measuring how far the workplace is located from a metropolis

The __target__ is __salary__. Salaries are given in the training set and need to be predicted for the test set.

In [1]:
# Import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Make matplotlib visualisations inline
%matplotlib inline

# Author information
__author__ = "Paawan Sharma"
__email__ = "paawansharma@protonmail.com"

# Part 2 - DISCOVER

### ---- 2 Load the data ----

In [2]:
# Load the training and test data in pandas DataFrames

train_data = pd.read_csv('data/train_features.csv', header=0)
train_data['salary'] = pd.read_csv('data/train_salaries.csv', header=0)['salary']

test_data = pd.read_csv('data/test_features.csv', header=0)

#### Take an initial look at the data.

In [3]:
display(train_data)
train_data.info()

display(test_data)
test_data.info()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83,130
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73,101
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38,137
3,JOB1362684407690,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17,142
4,JOB1362684407691,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16,163
...,...,...,...,...,...,...,...,...,...
999995,JOB1362685407682,COMP56,VICE_PRESIDENT,BACHELORS,CHEMISTRY,HEALTH,19,94,88
999996,JOB1362685407683,COMP24,CTO,HIGH_SCHOOL,NONE,FINANCE,12,35,160
999997,JOB1362685407684,COMP23,JUNIOR,HIGH_SCHOOL,NONE,EDUCATION,16,81,64
999998,JOB1362685407685,COMP3,CFO,MASTERS,NONE,HEALTH,6,5,149


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   jobId                1000000 non-null  object
 1   companyId            1000000 non-null  object
 2   jobType              1000000 non-null  object
 3   degree               1000000 non-null  object
 4   major                1000000 non-null  object
 5   industry             1000000 non-null  object
 6   yearsExperience      1000000 non-null  int64 
 7   milesFromMetropolis  1000000 non-null  int64 
 8   salary               1000000 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 68.7+ MB


Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362685407687,COMP33,MANAGER,HIGH_SCHOOL,NONE,HEALTH,22,73
1,JOB1362685407688,COMP13,JUNIOR,NONE,NONE,AUTO,20,47
2,JOB1362685407689,COMP10,CTO,MASTERS,BIOLOGY,HEALTH,17,9
3,JOB1362685407690,COMP21,MANAGER,HIGH_SCHOOL,NONE,OIL,14,96
4,JOB1362685407691,COMP36,JUNIOR,DOCTORAL,BIOLOGY,OIL,10,44
...,...,...,...,...,...,...,...,...
999995,JOB1362686407682,COMP54,VICE_PRESIDENT,BACHELORS,MATH,OIL,14,3
999996,JOB1362686407683,COMP5,MANAGER,NONE,NONE,HEALTH,20,67
999997,JOB1362686407684,COMP61,JANITOR,NONE,NONE,OIL,1,91
999998,JOB1362686407685,COMP19,CTO,DOCTORAL,MATH,OIL,14,63


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   jobId                1000000 non-null  object
 1   companyId            1000000 non-null  object
 2   jobType              1000000 non-null  object
 3   degree               1000000 non-null  object
 4   major                1000000 non-null  object
 5   industry             1000000 non-null  object
 6   yearsExperience      1000000 non-null  int64 
 7   milesFromMetropolis  1000000 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 61.0+ MB


### ---- 3 Clean the data ----

In [12]:
'''
Look for duplicate data in the training set. Duplicate job IDs may indicate corrupt data.
'''

print("There are {} duplicate rows in the training set.".format(train_data.duplicated().sum()))
print("There are {} duplicate jobIDs in the training set.".format(train_data['jobId'].duplicated().sum()))

There are 0 duplicate rows in the training set.
There are 0 duplicate jobIDs in the training set.


In [55]:
'''
Look for invalid data.
'''

# Numerical features in both dataframes

for df_name, df in {"test set": test_data, "training set": train_data}.items():
    print("Checking {}".format(df_name))
    print("Are years of experience non-negative?")
    print(df[df['yearsExperience'] < 0].empty)
    print("Are miles from metropolis non-negative?")
    print(df[df['milesFromMetropolis'] < 0].empty)
    print("\n")
    
print("Are salary values in training set non-negative?")
print(train_data[train_data['salary'] < 0].empty)

Checking test set
Are years of experience non-negative?
True
Are miles from metropolis non-negative?
True


Checking training set
Are years of experience non-negative?
True
Are miles from metropolis non-negative?
True


Are salary values in training set non-negative?
True


In [58]:
# Categorical features in both dataframes

for df_name, df in {"training set": train_data, "test set": test_data}.items():
    print("Checking {}".format(df_name))
    for feature in ['jobType', 'degree', 'major', 'industry']:
        print("Value for {} are:".format(feature))
        print(list(df[feature].unique()))
        print("\n")
    print("\n")

Checking training set
Value for jobType are:
['CFO', 'CEO', 'VICE_PRESIDENT', 'MANAGER', 'JUNIOR', 'JANITOR', 'CTO', 'SENIOR']


Value for degree are:
['MASTERS', 'HIGH_SCHOOL', 'DOCTORAL', 'BACHELORS', 'NONE']


Value for major are:
['MATH', 'NONE', 'PHYSICS', 'CHEMISTRY', 'COMPSCI', 'BIOLOGY', 'LITERATURE', 'BUSINESS', 'ENGINEERING']


Value for industry are:
['HEALTH', 'WEB', 'AUTO', 'FINANCE', 'EDUCATION', 'OIL', 'SERVICE']




Checking test set
Value for jobType are:
['MANAGER', 'JUNIOR', 'CTO', 'SENIOR', 'CEO', 'VICE_PRESIDENT', 'JANITOR', 'CFO']


Value for degree are:
['HIGH_SCHOOL', 'NONE', 'MASTERS', 'DOCTORAL', 'BACHELORS']


Value for major are:
['NONE', 'BIOLOGY', 'COMPSCI', 'PHYSICS', 'LITERATURE', 'MATH', 'CHEMISTRY', 'ENGINEERING', 'BUSINESS']


Value for industry are:
['HEALTH', 'AUTO', 'OIL', 'FINANCE', 'SERVICE', 'EDUCATION', 'WEB']




