## <p align = 'center'>Data Science Job Salaries

In this notebook, we are going to select the best machine learning model's performance to predict `salary_in_usd` variable using simple machine learning models like Linear Regression, Decision Tree Regressor, KNearest Neighbors Regressor and Random Forest Regressor.

In [None]:
import numpy as np
import pandas as pd
import os

#### Data Preprocessing

In [2]:
df = pd.read_csv('ds_salaries.csv', index_col='Unnamed: 0')

In [3]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [4]:
unused_cols = ['work_year', 'salary']
df.drop(unused_cols, axis=1, inplace=True )

In [5]:
df['remote_ratio'] = df['remote_ratio'].map({0:'No Remote', 50:'Partially Remote', 100:'Full Remote'})

In [6]:
df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,MI,FT,Data Scientist,EUR,79833,DE,No Remote,DE,L
1,SE,FT,Machine Learning Scientist,USD,260000,JP,No Remote,JP,S
2,SE,FT,Big Data Engineer,GBP,109024,GB,Partially Remote,GB,M
3,MI,FT,Product Data Analyst,USD,20000,HN,No Remote,HN,S
4,SE,FT,Machine Learning Engineer,USD,150000,US,Partially Remote,US,L


We are trying to find columns with high cardinality (more than 10 unique values) and drop them from the data frame.

In [7]:
high_cardinality_cols = [col for col in df.columns if df[col].nunique()>10 and df[col].dtype == 'object']
high_cardinality_cols

['job_title', 'salary_currency', 'employee_residence', 'company_location']

In [8]:
# drop them
df.drop(high_cardinality_cols, axis=1, inplace=True)
df.head()

Unnamed: 0,experience_level,employment_type,salary_in_usd,remote_ratio,company_size
0,MI,FT,79833,No Remote,L
1,SE,FT,260000,No Remote,S
2,SE,FT,109024,Partially Remote,M
3,MI,FT,20000,No Remote,S
4,SE,FT,150000,Partially Remote,L


In [9]:
from sklearn.model_selection import train_test_split

X = df.drop(['salary_in_usd'], axis=1)
y = df['salary_in_usd']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

Now that our data frame looks good, we are going to try and preprocess the categorical data into numbers. There are two options that we can use.
1. Ordinal Encoder
2. One Hot Encoder

### Ordinal Encoder

In [10]:
from sklearn.preprocessing import OrdinalEncoder

In [11]:
my_encoder = OrdinalEncoder()
ordinal_X_train = pd.DataFrame(my_encoder.fit_transform(X_train))
ordinal_X_test = pd.DataFrame(my_encoder.transform(X_test))

ordinal_X_train.columns = X_train.columns
ordinal_X_test.columns = X_test.columns

### OneHot Encoder

In [12]:
OH_X_train = pd.get_dummies(X_train)
OH_X_test = pd.get_dummies(X_test)

In [13]:
len(OH_X_train.columns) == len(OH_X_test.columns)

True

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

In [15]:
def print_mae(model, X_train, X_test, encoding_type: str, model_name: str):
    
    my_model = model
    my_model.fit(X_train, y_train)
    my_preds = my_model.predict(X_test)

    my_mae = mean_absolute_error(y_test, my_preds)
    
    print(f"{model_name} :: {encoding_type} :: {my_mae}")

In [16]:
models = {'Linear Regression':LinearRegression(),
          'Decision Tree Regressor':DecisionTreeRegressor(random_state=1), 
          'KNN Regressor':KNeighborsRegressor(), 
          'Random Forest Regressor':RandomForestRegressor(random_state=1)}

#### Ordinal Encoder MAE

In [17]:
for (model_name, model) in models.items():
    print_mae(model=model, 
              X_train=ordinal_X_train, 
              X_test=ordinal_X_test, 
              encoding_type='Ordinal', 
              model_name=model_name)

Linear Regression :: Ordinal :: 44957.03891143329
Decision Tree Regressor :: Ordinal :: 44415.610846094925
KNN Regressor :: Ordinal :: 49258.571052631574
Random Forest Regressor :: Ordinal :: 42798.90597758126


#### One Hot Encoder MAE

In [18]:
for (model_name, model) in models.items():
    print_mae(model=model, 
              X_train=OH_X_train, 
              X_test=OH_X_test, 
              encoding_type='One Hot', 
              model_name=model_name)

Linear Regression :: One Hot :: 41667.48684210526
Decision Tree Regressor :: One Hot :: 44663.99132855106
KNN Regressor :: One Hot :: 43099.58552631579
Random Forest Regressor :: One Hot :: 43119.25430596365
