# Salary estimator from listings

The city_state.json file was modified from this github repo [agalea91 - city_to_state_dictionary](https://github.com/agalea91/city_to_state_dictionary/blob/master/city_to_state.py).

The state_abbr.json file was modified from this github repo [JeffPaine - us_state_abbreviations.py](https://gist.github.com/JeffPaine/3083347).

The job posting dataset can be found on Kaggle [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)

## Setup

In [1]:
%%capture
% mamba install pandas xgboost scikit-learn plotly gensim #swifter
print('')

First we must import our packages to manage the dataset. Then we can import the data.

## Setup
Import the many packages

In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from importlib import reload
from IPython.display import HTML, display
from data import DataManager
from wordmod import Job2Vec
from catword import Categorizer

def load_scripts():
    reload(DataManager)
    reload(Job2Vec)
    reload(Categorizer)
    return (DataManager, Job2Vec, Categorizer)

# (DataManager, Job2Vec, Categorizer) = load_scripts()

Extract the job posting data from the CSV and clean it.

In [3]:
dm = DataManager()
df = dm.load_data_files()
def shorten_long_cols(row):
    for name in ['job_desc','company_desc','skills_desc']:
        if isinstance(row[name], str):
            row[name] = row[name][:150] + '...' 
    return row

display(HTML(df.head(10).apply(shorten_long_cols, axis=1).to_html()))

df = dm.get_postings().copy()

print(df.info())

def shorten_long_cols(row):
    for name in ['description','skills_desc']:
        if isinstance(row[name], str):
            row[name] = row[name][:150] + '...' 
    return row

display(HTML(df.head(3).apply(shorten_long_cols, axis=1).to_html()))


Reading CSVs
Joining CSV tables
Dropping unhelpful columns.
Renaming confusing columns.
<class 'pandas.core.frame.DataFrame'>
Index: 10673697 entries, 921716 to 3906267224
Data columns (total 28 columns):
 #   Column               Dtype  
---  ------               -----  
 0   company_name         object 
 1   job_title            object 
 2   job_desc             object 
 3   max_salary           float64
 4   pay_period           object 
 5   location             object 
 6   med_salary           float64
 7   min_salary           float64
 8   work_type            object 
 9   experience_level     object 
 10  skills_desc          object 
 11  compensation_type    object 
 12  benefit_inferred     float64
 13  benefit_type         object 
 14  skill_name           object 
 15  job_industry         object 
 16  max_salary_0         float64
 17  med_salary_0         float64
 18  min_salary_0         float64
 19  pay_period_0         object 
 20  compensation_type_0  object 
 21  name    

Unnamed: 0_level_0,company_name,job_title,job_desc,max_salary,pay_period,location,med_salary,min_salary,work_type,experience_level,skills_desc,compensation_type,benefit_inferred,benefit_type,skill_name,job_industry,max_salary_0,med_salary_0,min_salary_0,pay_period_0,compensation_type_0,name,company_desc,company_size,state,company_industry,speciality,employee_count
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You...,20.0,HOURLY,"Princeton, NJ",,17.0,Full-time,,"Requirements: \n\nWe are seeking a College or Graduate Student (can also be completed with school) with a focus in Planning, Architecture, Real Estate D...",BASE_SALARY,,,Marketing,Real Estate,20.0,,17.0,HOURLY,BASE_SALARY,Corcoran Sawyer Smith,"With years of experience helping local buyers and sellers just like yourself, we know how to locate the finest properties and negotiate the best deals...",2.0,NJ,Real Estate,real estate,402.0
921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You...,20.0,HOURLY,"Princeton, NJ",,17.0,Full-time,,"Requirements: \n\nWe are seeking a College or Graduate Student (can also be completed with school) with a focus in Planning, Architecture, Real Estate D...",BASE_SALARY,,,Marketing,Real Estate,20.0,,17.0,HOURLY,BASE_SALARY,Corcoran Sawyer Smith,"With years of experience helping local buyers and sellers just like yourself, we know how to locate the finest properties and negotiate the best deals...",2.0,NJ,Real Estate,new development,402.0
921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You...,20.0,HOURLY,"Princeton, NJ",,17.0,Full-time,,"Requirements: \n\nWe are seeking a College or Graduate Student (can also be completed with school) with a focus in Planning, Architecture, Real Estate D...",BASE_SALARY,,,Sales,Real Estate,20.0,,17.0,HOURLY,BASE_SALARY,Corcoran Sawyer Smith,"With years of experience helping local buyers and sellers just like yourself, we know how to locate the finest properties and negotiate the best deals...",2.0,NJ,Real Estate,real estate,402.0
921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You...,20.0,HOURLY,"Princeton, NJ",,17.0,Full-time,,"Requirements: \n\nWe are seeking a College or Graduate Student (can also be completed with school) with a focus in Planning, Architecture, Real Estate D...",BASE_SALARY,,,Sales,Real Estate,20.0,,17.0,HOURLY,BASE_SALARY,Corcoran Sawyer Smith,"With years of experience helping local buyers and sellers just like yourself, we know how to locate the finest properties and negotiate the best deals...",2.0,NJ,Real Estate,new development,402.0
1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committed to serving clients with best practices to help them with change, improvements and better quality of l...",50.0,HOURLY,"Fort Collins, CO",,30.0,Full-time,,,BASE_SALARY,,,Health Care Provider,,50.0,,30.0,HOURLY,BASE_SALARY,,,,,,,
10998357,The National Exemplar,Assitant Restaurant Manager,"The National Exemplar is accepting applications for an Assistant Restaurant Manager.\nWe offer highly competitive wages, healthcare, paid time off, com...",65000.0,YEARLY,"Cincinnati, OH",,45000.0,Full-time,,We are currently accepting resumes for FOH - Asisstant Restaurant Management with a strong focus on delivering high quality customer service. Prefer 1...,BASE_SALARY,,,Management,Restaurants,65000.0,,45000.0,YEARLY,BASE_SALARY,The National Exemplar,"In April of 1983, The National Exemplar began operation in the Mariemont, Ohio landmark, the Mariemont Inn. The Inn was constructed in the mid-1920's ...",1.0,Ohio,Restaurants,,15.0
10998357,The National Exemplar,Assitant Restaurant Manager,"The National Exemplar is accepting applications for an Assistant Restaurant Manager.\nWe offer highly competitive wages, healthcare, paid time off, com...",65000.0,YEARLY,"Cincinnati, OH",,45000.0,Full-time,,We are currently accepting resumes for FOH - Asisstant Restaurant Management with a strong focus on delivering high quality customer service. Prefer 1...,BASE_SALARY,,,Manufacturing,Restaurants,65000.0,,45000.0,YEARLY,BASE_SALARY,The National Exemplar,"In April of 1983, The National Exemplar began operation in the Mariemont, Ohio landmark, the Mariemont Inn. The Inn was constructed in the mid-1920's ...",1.0,Ohio,Restaurants,,15.0
23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associate Attorney,"Senior Associate Attorney - Elder Law / Trusts and Estates Our legal team is committed to providing each client with quality counsel, innovative solu...",175000.0,YEARLY,"New Hyde Park, NY",,140000.0,Full-time,,"This position requires a baseline understanding of online marketing including Search Engine Marketing, Search Engine Optimization, and campaign analyt...",BASE_SALARY,1.0,401(k),Other,Law Practice,175000.0,,140000.0,YEARLY,BASE_SALARY,"Abrams Fensterman, LLP","Abrams Fensterman, LLP is a full-service law firm that provides exceptional legal advice, personalized representation, and cost-effective results to p...",2.0,New York,Law Practice,Civil Litigation,222.0
23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associate Attorney,"Senior Associate Attorney - Elder Law / Trusts and Estates Our legal team is committed to providing each client with quality counsel, innovative solu...",175000.0,YEARLY,"New Hyde Park, NY",,140000.0,Full-time,,"This position requires a baseline understanding of online marketing including Search Engine Marketing, Search Engine Optimization, and campaign analyt...",BASE_SALARY,1.0,401(k),Other,Law Practice,175000.0,,140000.0,YEARLY,BASE_SALARY,"Abrams Fensterman, LLP","Abrams Fensterman, LLP is a full-service law firm that provides exceptional legal advice, personalized representation, and cost-effective results to p...",2.0,New York,Law Practice,Corporate & Securities Law,222.0
23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associate Attorney,"Senior Associate Attorney - Elder Law / Trusts and Estates Our legal team is committed to providing each client with quality counsel, innovative solu...",175000.0,YEARLY,"New Hyde Park, NY",,140000.0,Full-time,,"This position requires a baseline understanding of online marketing including Search Engine Marketing, Search Engine Optimization, and campaign analyt...",BASE_SALARY,1.0,401(k),Other,Law Practice,175000.0,,140000.0,YEARLY,BASE_SALARY,"Abrams Fensterman, LLP","Abrams Fensterman, LLP is a full-service law firm that provides exceptional legal advice, personalized representation, and cost-effective results to p...",2.0,New York,Law Practice,Criminal Law,222.0


Retrieving an existing dataset at c:\dev\job-estimator/archive/clean_postings.bin
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   company_name                122130 non-null  object 
 1   title                       123849 non-null  object 
 2   description                 123842 non-null  object 
 3   max_salary                  29417 non-null   float64
 4   pay_period                  36073 non-null   object 
 5   location                    123849 non-null  object 
 6   med_salary                  6199 non-null    float64
 7   min_salary                  29369 non-null   float64
 8   formatted_work_type         123849 non-null  object 
 9   formatted_experience_level  94440 non-null   object 
 10  skills_desc                 2439 non-null    object 
 11  work_type                   123849 non-null  obj

Unnamed: 0,company_name,title,description,max_salary,pay_period,location,med_salary,min_salary,formatted_work_type,formatted_experience_level,skills_desc,work_type,state,avg_salary
0,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You...,38798.991928,HOURLY,"Princeton, NJ",,32979.143139,Full-time,,"Requirements: \n\nWe are seeking a College or Graduate Student (can also be completed with school) with a focus in Planning, Architecture, Real Estate D...",FULL_TIME,NJ,35889.067533
1,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committed to serving clients with best practices to help them with change, improvements and better quality of l...",96997.47982,HOURLY,"Fort Collins, CO",,58198.487892,Full-time,,,FULL_TIME,CO,77597.983856
2,The National Exemplar,Assitant Restaurant Manager,"The National Exemplar is accepting applications for an Assistant Restaurant Manager.\nWe offer highly competitive wages, healthcare, paid time off, com...",65000.0,YEARLY,"Cincinnati, OH",,45000.0,Full-time,,We are currently accepting resumes for FOH - Asisstant Restaurant Management with a strong focus on delivering high quality customer service. Prefer 1...,FULL_TIME,OH,55000.0


### Create a statistical summary of the pay data.

In [4]:
pay_cols = ['max_salary','med_salary','min_salary']
pay_period_df = df[pay_cols]
display(HTML(pay_period_df.describe().style.format(precision=0,thousands=",").to_html()))    


Unnamed: 0,max_salary,med_salary,min_salary
count,29417,6199,29369
mean,116322,56978,84732
std,79435,39234,50199
min,10335,10000,10000
25%,65000,32979,49710
50%,100000,43746,75000
75%,150000,67898,110000
max,1500000,300500,1080000


### Display a bar graph of average salaries by state.

In [5]:
df = dm.get_postings_with_pay()[['state','avg_salary']].copy()

groups = df.groupby('state')
group_count = groups.count()
df = groups.mean()
df['count'] = group_count
df = df.dropna(axis=1).sort_values(by='avg_salary')

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Bar(
        x = df.index.values, 
        y=df['avg_salary'],
        name="Average Salary",
    ), 
    secondary_y=False)

fig.add_trace(
    go.Scatter(
        x = df.index.values,
        y = df['count'],
        name="Sample Size"
    ),
    secondary_y=True
)

fig.update_xaxes(title_text="State",tickangle=90)

# Set y-axes titles
fig.update_yaxes(title_text="Dollars per year", secondary_y=False)
fig.update_yaxes(title_text="Job Listings (log)", secondary_y=True, type="log")

fig.show()

Dropping rows where every pay column is empty.


Create a dataset

In [6]:
df = dm.get_postings().copy()
print('Loading j2v word vectors.')
job2vec = Job2Vec()
job2vec.get_dataset(df)
j2v = job2vec.get_model()

Loading j2v word vectors.
Retrieving an existing dataset at c:\dev\job-estimator/archive/tokenized_jobs.bin


ValueError: Job data must be passed in before the data can be prepared for training.

### Create a model to genereate entity embeddings for XGBoost

In [None]:
df = dm.get_postings().copy()

x_cols=['state',
        'pay_period',
        'formatted_work_type',
        'formatted_experience_level',
        'title']
y_col = 'avg_salary'

mask = df[['title', 'state', y_col]].notna().all(axis=1) & df[y_col].gt(0)

df = df[x_cols+[y_col]].loc[mask].copy().reset_index()

print(df['title'].head().values)

['Marketing Coordinator' 'Mental Health Therapist/Counselor'
 'Assitant Restaurant Manager'
 'Senior Elder Law / Trusts and Estates Associate Attorney'
 ' Service Technician']


In [None]:
categorizer = Categorizer(j2v.wv, job2vec.tokenize)
categorizer.create_category_vectors()

Retrieving category vectors from c:\dev\job-estimator/assets/w2v/vectorized_categories.bin


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer

vector_length = job2vec.get_vector_length()

def title_to_vec(titles: pd.DataFrame):
    vector_cols = [f'title{n}' for n in range(job2vec.get_vector_length())]
    rows = [categorizer.categorize_to_vec(x) for x in titles['title'].values]
    return pd.DataFrame(rows, columns=vector_cols)

title_pipe = Pipeline(steps=[
    ("to_vec", FunctionTransformer(title_to_vec))
])

cat_pipe = Pipeline(steps=[
    ("encoder", OneHotEncoder(handle_unknown="ignore")),
])

non_title_cols = ['state', 'pay_period', 'formatted_work_type', 'formatted_experience_level']

col_transformer = ColumnTransformer(transformers=[
    ("cat", cat_pipe, non_title_cols),
    ("title", title_pipe, ['title'])
])

preprocessor = Pipeline(steps=[
    ('col_trfm', col_transformer)
    #('to_dmx', FunctionTransformer(xgb.DMatrix))
])

x, y = df[x_cols], df[y_col]

preprocessor = preprocessor.fit(x,y)

This did not match: Marketing Coordinator
This did not match: Mental Health Therapist/Counselor
This did not match: Assitant Restaurant Manager
This did not match: Senior Elder Law / Trusts and Estates Associate Attorney
This did not match:  Service Technician
This did not match: Economic Development and Planning Intern
This did not match: Building Engineer
This did not match: Administrative Coordinator
This did not match: Customer Service / Reservationist
This did not match: General Laborer
This did not match: Administrative Assistant
This did not match: Marketing & Office Coordinator
This did not match: Software Support Specialist
This did not match: Coordinator for Multicultural Student Organizations
This did not match: Chief Operating Officer
This did not match: Associate Attorney
This did not match: Manager, Retail Pharmacy
This did not match: SALES
This did not match: Sales Associate Natural Food Products
This did not match: National Sales Manager
This did not match: Montessori L

ValueError: Sentence is empty after tokenization so it will not be included. "Echocardiographer"

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score

xgb_reg: xgb.XGBRegressor = xgb.XGBRegressor(
    booster='gbtree',
    device='cuda',
    #random_state=1,
    eta=0.3, 
    max_depth=6,
    subsample=.8,
    colsample_bytree=.8,
    alpha=0,
    objective='reg:squarederror',
    eval_metric='mae',
    early_stopping_rounds=15,
    grow_policy='lossguide',
    verbosity=0,
    sampling_method='gradient_based',
    tree_method= 'hist',
    max_bin=1024,
    n_estimators=100,
    max_delta_step=1
    )

training_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ('reg', xgb_reg)
])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.1)
x_test = preprocessor.transform(x_test)

xgb_pipe = training_pipe.fit(x_train, y_train, 
                             reg__eval_set=[(x_test, y_test)]
                             )

#print(xgb_reg.best_score)

[0]	validation_0-mae:40253.81123
[1]	validation_0-mae:36592.50695
[2]	validation_0-mae:34207.72409
[3]	validation_0-mae:32974.32231
[4]	validation_0-mae:31947.13088
[5]	validation_0-mae:31333.66748
[6]	validation_0-mae:30945.27781
[7]	validation_0-mae:30686.13869
[8]	validation_0-mae:30525.03827
[9]	validation_0-mae:30397.34480
[10]	validation_0-mae:30299.49976
[11]	validation_0-mae:30241.40347
[12]	validation_0-mae:30188.46973
[13]	validation_0-mae:30175.12776
[14]	validation_0-mae:30148.30256
[15]	validation_0-mae:30121.31661
[16]	validation_0-mae:30114.77330
[17]	validation_0-mae:30099.78985
[18]	validation_0-mae:30083.34665
[19]	validation_0-mae:30081.50205
[20]	validation_0-mae:30075.20326
[21]	validation_0-mae:30077.04281
[22]	validation_0-mae:30065.72869
[23]	validation_0-mae:30040.52489
[24]	validation_0-mae:30037.88083
[25]	validation_0-mae:30024.19967
[26]	validation_0-mae:30017.68710
[27]	validation_0-mae:30009.88161
[28]	validation_0-mae:30014.81500
[29]	validation_0-mae:30

In [None]:
test = df[x_cols+[y_col]].sample(10)
res = xgb_pipe.predict(test[x_cols])
test['predicted']=res
display(HTML(test.style.format(precision=2,thousands=",").to_html())) 

Unnamed: 0,state,pay_period,formatted_work_type,formatted_experience_level,title,avg_salary,predicted
27655,NY,HOURLY,Full-time,Mid-Senior level,Shift Supervisor- Bilingual Spanish Preferred,34919.09,71434.69
1113,CT,HOURLY,Part-time,,Group Fitness Instructor,30534.22,24460.65
11429,NC,YEARLY,Full-time,,Cyber Security IAM Architect,154322.0,105058.88
11140,TN,HOURLY,Full-time,Mid-Senior level,CT Technologist,77113.0,64846.15
10537,CA,HOURLY,Contract,Mid-Senior level,Event Coordinator,96997.48,117814.92
4780,WI,YEARLY,Full-time,Mid-Senior level,Electrical Project Manager/Estimator,112500.0,111660.81
15969,IN,YEARLY,Full-time,Executive,General Manager,102500.0,152276.83
8570,IL,YEARLY,Full-time,,Staff Application Developer,204250.0,101096.45
7281,IL,YEARLY,Full-time,Mid-Senior level,National Enterprise Business Development Director,180000.0,129527.44
8395,CA,YEARLY,Full-time,Mid-Senior level,STAFF PHYSICIAN,280000.0,135153.48


In [None]:
xgb_reg.save_model('c:/dev/job-estimator/assets/XGBReggressor.ubj')


In [None]:
df = df[['title','state','avg_salary']].copy().dropna(axis=1)

fig = go.Figure(data=[go.Scatter3d(x=df['state'], y=df['title'], z=df['avg_salary'], mode='markers')])

fig.update_xaxes(title_text="State")
fig.update_yaxes(title_text="Position")

fig.show()

Retrieving category vectors from c:\dev\job-estimator/assets/w2v/vectorized_categories.bin
Retrieving an existing data at c:\dev\job-estimator/archive/categorized_job_titles.bin
