# Getting Started

In [1]:
# Author information
__author__ = "Troy Reynolds"
__email__ = "Troy.Lloyd.Reynolds@gmail.com"

In [2]:
# libraries
import pandas as pd
import sys

# Extend the directory to get created functions
sys.path.insert(0, "./function_scripts")
from data_import_functions import read_data, get_data

## Understanding the Problem

The goal for this model is to accurately predict the salary of a given job posting based on certian features.

#### Purpose:
This model can help job searchers determine whether a job listing offers a reasonable salary based on the requirements and distinct characteristics compared to other jobs with similar requirements and characteristics. Additionally, this model offers applicants leverage when negotiating salaries if they decide to apply to listings with seemingly unreasonable salaries.

#### Error Metric:
We are using the [Mean Squared Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) to calculate the model's accuracy and determine the best model. MSE is chosen over other regression error metrics because it penalizes predictions that are farther away from the target value as opposed to [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).

## Data:
The historical data is stored as a csv files: 
* train_salaries: Each row has an ID and associated salary value.
* train_features: Each row represents metadata for an individual job posting with its associated ID
* test_features: Same format as train_features

We can load in the training data and view its associated charageristics. train_features and train_salaries will be joined on the column "jobId".

In [3]:
# Dataset characteristics
read_data("train", verbose = True)


*********************** Reading in the features dataset ************************

it has 1000000 rows and 8 columns

************************* It has the following columns *************************

jobId                  object
companyId              object
jobType                object
degree                 object
major                  object
industry               object
yearsExperience         int64
milesFromMetropolis     int64
dtype: object

*********************** The first 5 rows look like this ************************

              jobId companyId         jobType       degree      major  \
0  JOB1362684407687    COMP37             CFO      MASTERS       MATH   
1  JOB1362684407688    COMP19             CEO  HIGH_SCHOOL       NONE   
2  JOB1362684407689    COMP52  VICE_PRESIDENT     DOCTORAL    PHYSICS   
3  JOB1362684407690    COMP38         MANAGER     DOCTORAL  CHEMISTRY   
4  JOB1362684407691     COMP7  VICE_PRESIDENT    BACHELORS    PHYSICS   

  industry  yearsExperie

In [4]:
# load training data
data = get_data(dset = "train", 
                key = "jobId", 
                clean_details = True, 
                target_variable = "salary")


*************************** Data Cleanliness Report ****************************

Missing Values:
jobId                  0
companyId              0
jobType                0
degree                 0
major                  0
industry               0
yearsExperience        0
milesFromMetropolis    0
salary                 0
dtype: int64

Target Variable: salary
All values are positive.
There are 5 values equal to 0.

Duplicates:
There are no duplicates in the data.


In [5]:
# Load test data without report
test_features = get_data(dset = "test")

#### Report:
From the report, we see that there are 8 features (6 categorical features, 2 numerical features) and 1,000,000 observations. Furthermore, there are no missing values nor duplicates. However, there are 5 salary values that are equal to zero. Since there are 1 million observations, the removal of 5 observations will not significantly affect the model.