# Creating Baseline Models

In [1]:
# Author information
__author__ = "Troy Reynolds"
__email__ = "Troy.Lloyd.Reynolds@gmail.com"

In [2]:
# libraries
import pandas as pd
import sys
import inspect
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from statistics import mean, stdev
import os

# Extend the directory to get created functions
sys.path.insert(0, "./function_scripts")

# import helper functions
from data_import_functions import get_data
from Baseline_functions import avg_per_industry_degree, cross_val
from results import save_results

In [3]:
# load in the data
data = get_data("train", key = "jobId", target_variable = "salary", remove_zeros = True)

# drop id variables besides JobID for pairing in baseline regressor
features = data.drop(["companyId", "salary"], axis = 1)
target = data["salary"]

## Baseline Model Proposition
The baseline model will reflect a naive estimation without higher-level modeling. The 2 baseline models that I propose are:
1. Predict the average per industry and degree
2. Predict the average per job title
3. Standard linear regression using categorical level averages

## Creation of Estimator Class
The class is created and stored in a helper .py file. The code is presented below.

In [4]:
# Predict the average per specified columns
print(inspect.getsource(avg_per_industry_degree))

class avg_per_industry_degree():
    """
    Creates a model based on averages of categorical variable levels
    
    Methods
    fit: Calculates the avarage of different levels or combination of levels for prediction
    predict: predicts the target variable based on the averages
    """
    def __init__(self, columns):
        self.fitted_columns = columns
        self.fitted = False
    
    def fit(self, X, y = None):
        """
        Parameters:
        X: dataframe
        y: dataframe, series, or numpy array
        columns: list of columns to have the averages based on
        """
        X_copy = X.copy()
        
        if (all(x in X.columns[X.dtypes == "O"] for x in self.fitted_columns)):
            X_copy["target"] = y.copy()
            self.level_averages = X_copy.groupby(self.fitted_columns) \
                                   .mean() \
                                   .drop(["yearsExperience", "milesFromMetropolis"], axis = 1)
            self.fitted = True
  

## Baseline Error Metric

#### Predict the average per industry and degree

In [5]:
# run cross validation
industry_degree_reg = avg_per_industry_degree(columns = ["industry", "degree"])
industry_degree_mse = cross_val(model = industry_degree_reg, X = features, y = target, cv = 5)

#### Predict the average per job title

In [6]:
jobType_reg = avg_per_industry_degree(columns = ["jobType"])
jobType_mse = cross_val(model = jobType_reg, X = features, y = target, cv = 5)

#### Standard linear regression using categorical level averages

In [7]:
# drop id's and replace categorical level with salary average of level
data_no_id = data.drop(["jobId", "companyId"], axis = 1)
cat_variables = data_no_id.columns[data_no_id.dtypes == "O"]
for col in cat_variables:
    data_no_id[col] = data_no_id.groupby(col)["salary"].transform("mean")
    
features_no_cat = data_no_id.drop("salary", axis = 1)

In [8]:
# regression
lin_reg = LinearRegression()
lin_neg_mse = cross_val_score(lin_reg,
                              features_no_cat, 
                              target, 
                              scoring = "neg_mean_squared_error", 
                              cv = 5,
                              verbose = 0,
                              n_jobs = -1)
lin_mse = -1*lin_neg_mse

#### Results
The simple linear regression vastly outperformed the other naive baseline models. This suggests that a model will improve any average guess by a large margin. In conclusion, the MSE benchmark to surpass is 399.131258.

In [9]:
# collect results
avg_mse = [mean(industry_degree_mse), mean(jobType_mse), mean(lin_mse)]
std_mse = [stdev(industry_degree_mse), stdev(jobType_mse), stdev(lin_mse)]
estimator = ["Industry and Degree Average", "Job Type Average", "Simple Linear Regression"]

results = pd.DataFrame(data = {"Estimator": estimator, 
                               "Average MSE": avg_mse, 
                               "Standard Deviation MSE": std_mse})
# display results
display(results.sort_values("Average MSE", ascending = True))

# save results
results_sorted = results.sort_values("Average MSE", ascending = True)
save_results(results_sorted, "baseline_model_results")

Unnamed: 0,Estimator,Average MSE,Standard Deviation MSE
2,Simple Linear Regression,399.131258,2.084684
1,Job Type Average,963.9446,3.304819
0,Industry and Degree Average,1125.587248,4.999431


Results Saved to .\results\baseline_model_results.pkl


## Model Proposal

The proposed models will be split into 2 categories: linear model and tree-based model.

<u>Linear Model:</u> The linear models proposed based on Ridge regression because all features are correlated with the target variable and thus valuable. Lasso Regression is added with the addition of polynomial features for potential feature selection.
* Linear Regression
* Polynomial Linear Regression 
* Linear Regression with only industry and major interaction terms
* Ridge Regression
* Polynomial Ridge Regression 

<u>Tree-based Model:</u>
* Random-Forest Model
* XGBoost Regresssor