<div style="display:fill;
            border-radius:15px;
            background-color:skyblue;
            font-size:210%;
            font-family:sans-serif;
            letter-spacing:0.5px;
            padding:10px;
            color:white;
            border-style: solid;
            border-color: black;
            text-align:center;">
<b>
 👨‍🔬 In-Depth 10 Regressors to Predict Data Science Salary 💰</b></div>

This notebook will show **10 diversified regressors** made up with parametric, non-parametric and ensemble learning methods to predict on data science salary. It will also show **Nested Grid Search** to tuned the pre-processor along with the estimator parameters. Besides, several visualisation idioms, such as **Bar of Pie and Residual Plots** will also be depicted. The best model pipeline can be retrieved at the end to make a final prediction. Hope you enjoy reading this, if you find this notebook useful, please **upvote and comment**. Thank you.

Author: Morris Lee <br>
Date: 4-9-2022

#### [1.0 Preprocessing Part](#1.0)
* [1.1 Import Packages and Define Useful Functions](#1.1)
* [1.2 Inspect Duplications](#1.2)
* [1.3 🎨 Inspect Value Counts](#1.3)
* [1.4 Get GDP Per Capita](#1.4)
* [1.5 Add New Columns](#1.5)
* [1.6 🎨 Bar of Pie Chart](#1.6)
* [1.7 Reduce Dimension of Job Title](#1.7)
* [1.8 Distinguish Categorical and Numerical](#1.8)
* [1.9 One Hot Encode](#1.9)
* [1.10 🎨 Plot Distributions](#1.10)
    
#### [2.0 Modelling Part](#2.0)
* [2.1 Import Modelling Packages](#2.1)
* [2.2 Define Nested Grid Search Functions](#2.2)
* [2.3 Define Residual Plotting Functions](#2.3)
* [2.4 ⭐ Modelling - 10 Models - Training + Evaluations](#2.4)
* [2.5 Concatenate Results](#2.5)
* [2.6 Get Overall Best Results](#2.6)
* [2.7 Make a Final Prediction](#2.7)

# <b>1.0 <span style='color:red'>|</span> Pre-Processing Part</b> <a class="anchor" id="1.0"></a>

# <b>1.1 <span style='color:red'>|</span> Import Packages and Define Useful Functions </b> <a class="anchor" id="1.1"></a>

Let's import several useful functions that will use in this notebook:
1. def vc - to pretty show the value counts of a column
2. def shape - to pretty show the dimension of a dataframe


In [None]:
import pandas as pd
import numpy as np
import pycountry
import matplotlib.pyplot as plt
from matplotlib.patches import ConnectionPatch
from matplotlib.patches import Circle
import matplotlib
import seaborn as sns

def shape(df,df_name):
    print(f'STATUS: Dimension of "{df_name}" = {df.shape}')

def vc(df, column, r=False):
    vc_df = df.reset_index().groupby([column]).size().to_frame('count')
    vc_df['percentage (%)'] = vc_df['count'].div(sum(vc_df['count'])).mul(100)
    vc_df = vc_df.sort_values(by=['percentage (%)'], ascending=False)
    if r:
        return vc_df
    else:
        print(f'STATUS: Value counts of "{column}"...')
        display(vc_df)
        
df = pd.read_csv("./data/ds_salaries.csv")
df.drop('Unnamed: 0', axis=1, inplace=True)
display(df.head())
df.info()

# <b>1.2 <span style='color:red'>|</span> Inspect Duplications </b> <a class="anchor" id="1.2"></a>

There are 42 duplicated rows in the dataframe, let's remove them

In [None]:
num_duplicated = len(df[df.duplicated()])
print(f'STATUS: There are {num_duplicated} duplicated rows')
shape(df,'df')
df = df.drop_duplicates()
shape(df,'After removing duplicates')
df.head()

# <b>1.3 <span style='color:red'>|</span> Inspect Value Counts </b> <a class="anchor" id="1.3"></a>

Let's study the value counts of the dataframe. In order to easy knowing the labels meaning, the shortforms have replaced to more explicit meaningful words. Asides of showing the value counts, we can plot the donut chart... Sounds delicious :p

In [None]:
df = df.replace({'EN': 'Entry-level', 'SE': 'Senior-level', 'EX':'Expert', 'MI':'Mid-level',
           'PT': 'Part-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'})

plt.style.use('ggplot')

def pie(df, column):

    fig, axs = plt.subplots(nrows = 2, ncols = 2)
    fig = matplotlib.pyplot.gcf()
    fig.subplots_adjust(wspace=0.1)
    fig.set_size_inches(15, 13)
    fig.suptitle(f"Donut Charts", fontsize=20,fontweight='bold')
    
    counter = 0
    for i in range(2):
        for j in range(2):
            target = column[counter]
            # Pie chart, where the slices will be ordered and plotted counter-clockwise:
            labels = df[target].value_counts().index.tolist()
            sizes = np.rint(df[target].value_counts().values/ df[target].value_counts().values.sum() *100)
            explode = tuple(np.zeros(len(labels))+0.1)

            axs[i,j].pie(sizes, labels=labels, autopct='%1.1f%%', radius=2, explode = explode)
            axs[i,j].axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
            axs[i,j].set_title(f'{target}', fontsize=18, fontname="Arial",fontweight='bold')

            #draw circle
            centre_circle = Circle((0,0),1,fc='white')
            axs[i,j].add_patch(centre_circle)
            
            counter+=1

    return plt.show()

pie(df, ['work_year','experience_level','employment_type','company_size'])



categorical = ['work_year', 'experience_level','employment_type','job_title','salary_currency','employee_residence','company_location','company_size']
for col in categorical:
    vc(df,col)

# <b>1.4 <span style='color:red'>|</span> Get GDP Per Capita </b> <a class="anchor" id="1.4"></a>

Here is a interesting one, gdp per capita information has obtained from other kaggle dataset. The objective is to merge back to the main dataframe to make it more meaningful.

In [None]:
url = 'https://raw.githubusercontent.com/k-w-lee/kaggle-file/main/gdp_per_capita.csv'
gdp_per_capita = pd.read_csv(url)
gdp_per_capita = gdp_per_capita[['Country','2018']]
gdp_per_capita.columns = ['country','gdp_per_capita']
gdp_per_capita['gdp_per_capita']=pd.to_numeric(gdp_per_capita['gdp_per_capita'],errors='coerce')
gdp_per_capita

# <b>1.5 <span style='color:red'>|</span> Add New Columns </b> <a class="anchor" id="1.5"></a>

Here is a bunch of pre-processing concentrate in one kernel. To summarise, 6 new columns have been added to further process the data. Eventually, several redundant columns were then dropped
 

In [None]:
shape(df,'df before added columns')
# to get country name for employee residence
df['employee_residence_name'] = df.apply(lambda x: pycountry.countries.get(alpha_2 =x['employee_residence']).name \
                                         if (pycountry.countries.get(alpha_2 =x['employee_residence']) is not None) \
                                         else 'None' ,axis=1)
print("STATUS: Added column employee_residence_name")

# to get country name for company residence
df['company_location_name'] = df.apply(lambda x: pycountry.countries.get(alpha_2 =x['company_location']).name \
                                         if (pycountry.countries.get(alpha_2 =x['company_location']) is not None) \
                                         else 'None' ,axis=1)
print("STATUS: Added column company_location_name")

# to get company's country gdp per capita
df2 = df.merge(gdp_per_capita, how='left', left_on='company_location_name', right_on = 'country')
df2.rename(columns = {'gdp_per_capita': 'gdp_per_capita_company'}, inplace=True)
df2.drop('country', inplace=True, axis=1)
print("STATUS: Added column company's country gdp per capita")

# to get residence's country gdp per capita
df2 = df2.merge(gdp_per_capita, how='left', left_on='employee_residence_name', right_on = 'country')
df2.drop('country', inplace=True, axis=1)
df2.rename(columns = {'gdp_per_capita': 'gdp_per_capita_residence'}, inplace=True)
print("STATUS: Added column residence's country gdp per capita")

# is the working country same as residence country?
df2['same_working_country'] = df2.apply(lambda x:  'Local Worker' if x['company_location_name'] == x['employee_residence_name'] else 'Expatriate', axis=1)
print("STATUS: Added column same_working_country")

# is the working country gdp per capita same as residence country?
df2['went_high_went_low_gdp_capita'] = df2.apply(lambda x:  'Went Higher GDP per Capita' if x['gdp_per_capita_company'] > x['gdp_per_capita_residence'] else ('Went Lower GDP per Capita' if x['gdp_per_capita_company'] < x['gdp_per_capita_residence'] else 'Same'), axis=1)
print("STATUS: Added column went_high_went_low_gdp_capita")

shape(df2,'Before drop NA')
# dropna for those country code can't be captured
df2=df2.dropna()

pycountry.currencies.get(alpha_3='ARS')

# drop redundant columns
to_drop = ['employee_residence_name', 'company_location_name','employee_residence',\
           'company_location','salary_currency','salary']
df2 = df2.drop(to_drop, axis=1)
print(f"STATUS: Dropped {to_drop}")

shape(df2,'After drop NA')
df2

# <b>1.6 <span style='color:red'>|</span> Bar of Pie Chart </b> <a class="anchor" id="1.6"></a>


This section will answer questions like: Is there any expatriate in the dataset? If Yes, are they going to a country with higher GDP per Capita or Lower? How is the distribution? Let Bar of Pie answer you this question. :)

In [None]:
# make figure and assign axis objects

def pie_bar(ratios_pie, labels_pie, ratios_bar, labels_bar_tuple, colors_list, color_bar,bar_title,pie_title):
    fig = plt.figure(figsize=(9, 10.0625))
    ax1 = fig.add_subplot(121)
    ax2 = fig.add_subplot(122)
    fig.subplots_adjust(wspace=0)

    # pie chart parameters
    ratios = ratios_pie
    labels = labels_pie
    explode = [0.1, 0]
    # rotate so that first wedge is split by the x-axis
    angle = -180 * ratios[0]
    ax1.pie(ratios, autopct='%1.1f%%', startangle=angle,
            labels=labels, explode=explode, colors = colors_list)
    ax1.set_title(pie_title, fontsize=20, fontname="Arial",fontweight='bold')
    # bar chart parameters
    xpos = 0
    bottom = 0
    ratios = ratios_bar
    width = .2
    colors = color_bar

    for j in range(len(ratios)):
        height = ratios[j]
        ax2.bar(xpos, height, width, bottom=bottom, color=colors[j])
        ypos = bottom + ax2.patches[j].get_height() / 2
        bottom += height
        ax2.text(xpos, ypos, "%d%%" % (ax2.patches[j].get_height() * 100),
                 ha='center')

    ax2.set_title(bar_title, fontsize=20, fontname="Arial",fontweight='bold')
    ax2.legend(labels_bar_tuple, loc='upper right')
    ax2.axis('off')
    ax2.set_xlim(- 2.5 * width, 2.5 * width)

    # use ConnectionPatch to draw lines between the two plots
    # get the wedge data
    theta1, theta2 = ax1.patches[0].theta1, ax1.patches[0].theta2
    center, r = ax1.patches[0].center, ax1.patches[0].r
    bar_height = sum([item.get_height() for item in ax2.patches])

    # draw top connecting line
    x = r * np.cos(np.pi / 180 * theta2) + center[0]
    y = np.sin(np.pi / 180 * theta2) + center[1]
    con = ConnectionPatch(xyA=(- width / 2, bar_height), xyB=(x, y),
                          coordsA="data", coordsB="data", axesA=ax2, axesB=ax1)
    con.set_color([0, 0, 0])
    con.set_linewidth(4)
    ax2.add_artist(con)

    # draw bottom connecting line
    x = r * np.cos(np.pi / 180 * theta1) + center[0]
    y = np.sin(np.pi / 180 * theta1) + center[1]
    con = ConnectionPatch(xyA=(- width / 2, 0), xyB=(x, y), coordsA="data",
                          coordsB="data", axesA=ax2, axesB=ax1)
    con.set_color([0, 0, 0])
    ax2.add_artist(con)
    con.set_linewidth(4)

    return plt.show()


labels_pie = df2.same_working_country.value_counts(normalize=True).index.tolist()
ratios_pie = df2.same_working_country.value_counts(normalize=True).tolist()

# TO REVERSE THE PIE CHART TO CORRECT POSITION
ratios_pie.insert(0, ratios_pie.pop(1))
labels_pie.insert(0, labels_pie.pop(1))

ratios_bar = df2[df2.same_working_country == 'Expatriate'].went_high_went_low_gdp_capita.value_counts(normalize=True).tolist()
labels_bar_tuple = tuple(df2[df2.same_working_country =='Expatriate'].went_high_went_low_gdp_capita.value_counts(normalize=True).index.tolist())
color_bar = ['lightgreen', 'pink','skyblue']
color_pie = ['lime', 'lightcoral']
pie_bar(ratios_pie, labels_pie, ratios_bar, labels_bar_tuple, color_pie, color_bar, "GDP Per Capita in the Working Country",'Is the Employee Local Worker?')

# <b>1.7 <span style='color:red'>|</span> Reduce Dimension of Job Title </b> <a class="anchor" id="1.7"></a>


The 'job_title' column is further pre-process to reduce number of unique values. I discover there are many values not necessarily to be unique, hence they are merged together. The detailes are summarised in kernel below.

In [None]:
df2 = df2.replace({'ML Engineer': 'Machine Learning Engineer', 
                   'BI Data Analyst' : 'Big Data Engineer', 
                   'Data Analytics Engineer': 'Data Analyst', 
                   'Head of Machine Learning':'Machine Learning Manager', 
                   'Lead Machine Learning Engineer':'Machine Learning Manager',
                   'Staff Data Scientist':'Data Scientist',
                   'Big Data Architect':'Big Data Engineer',
                   'Data Analytics Lead':'Data Analytics Manager', 
                   'Lead Data Scientist':'Head of Data Science',
                   'Machine Learning Infrastructure Engineer':'Machine Learning Engineer',
                   'Data Specialist':'Data Scientist',
                   'Marketing Data Analyst':'Data Analyst',
                   'Finance Data Analyst':'Data Analyst',
                   'Financial Data Analyst':'Data Analyst',
                   'Product Data Analyst':'Data Analyst',
                   '3D Computer Vision Researcher':'Computer Vision Engineer',
                   'Computer Vision Software Engineer':'Computer Vision Engineer',
                   'NLP Engineer':'Data Scientist',
                   'Applied Machine Learning Scientist': 'Machine Learning Engineer', 
                   'ETL Developer':'Data Architect','Principal Data Analyst':'Lead Data Analyst'})

searchfor = ['Head', 'Lead', 'Manager','Director','Principal']
is_managerial = df2['job_title'].str.contains('|'.join(searchfor))
df2['is_managerial'] = np.where(is_managerial, True, False)

display(df2.head())
vc(df2,'job_title')

# <b>1.8 <span style='color:red'>|</span> Distinguish Categorical and Numerical </b> <a class="anchor" id="1.8"></a>


To easy distinguish CATEGORICAL or NUMERICAL, we can use a function to calculate how many unique values in a column to decide. At here, I put if there is more than 10 unique values, then it will be considered as numerical. In fact, it will also included 'job_title' as numerical, which is incorrect. I purposely put so because later I would like to examine I should use OrdinalEncoding or OneHotEncoding to treat this column can generate a better result. So, I temporarily park it here first. The other categorical columns will be treated with OneHotEncoding. 

In [None]:
df3 = df2.copy()

def get_num_cat_col(df, n):
    numerical_columns = []
    categorical_columns = []
    for col in df.columns:
        len_unique = len(df[col].unique())
        if len_unique <= n:
            categorical_columns.append(col)
        else:
            numerical_columns.append(col)
    return numerical_columns, categorical_columns

num, cat = get_num_cat_col(df2, 10)
print('NUMERICAL', num , '\nCATEGORICAL' ,cat)

# <b>1.9 <span style='color:red'>|</span> One Hot Encode </b> <a class="anchor" id="1.9"></a>


We can define a function to easy implement the one-hot-encoding process, that automatic rename, drop and join the encoded columns. Sound hassle free!

In [None]:
def one_hot_encode(df, column):
    # Get one hot encoding of columns B
    one_hot = pd.get_dummies(df[column]).add_prefix(f'{column}_')
    # Drop column as it is now encoded
    df = df.drop(column,axis = 1)
    print(f"one hot encoded {column}")
    # Join the encoded df
    df = df.join(one_hot)
    return df

for col in cat:
    df3 = one_hot_encode(df3, col)
shape(df3,'After One Hot Encoded')
df3.head()

# <b>1.10 <span style='color:red'>|</span> Plot Distributions </b> <a class="anchor" id="1.10"></a>


Now, we want to plot the distribution of numerical columns. But remember to remove the 'job_title' first that temporarily parked there. 

After plotting, we notice that our target variable 'salary_in_usd' is skewed to the right. We can treat the skewness by using log transformation and the prediction can be reversed back. The log transformation is a reversible transforme so it is not a problem to use this to alter your target values, as it can be transformed back by using exponential function.

In [None]:
def vis_dist(df, col):
    variable = df[col].values
    ax = sns.displot(variable)
    plt.title(f'Distribution of {col}', fontsize=20, fontname="Arial",fontweight='bold')
    plt.xlabel(f'{col}')
    return plt.show()

num.remove('job_title')
for col in num:
    vis_dist(df3, col)


# <b>2.0 <span style='color:red'>|</span> Modelling </b> <a class="anchor" id="2.0"></a>


# <b>2.1 <span style='color:red'>|</span> Import Modelling Packages </b> <a class="anchor" id="2.1"></a>


Let's import a bunch of packages for modelling

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
import scipy as sp
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
import matplotlib
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression


# <b>2.2 <span style='color:red'>|</span> Define Nested Grid Search Functions </b> <a class="anchor" id="2.2"></a>

Here is the illustration of how we can find a unique combination of PRE-PROCESSORS by using PARAM GRID. The idea is to FIND the BEST TREATMENT for your data with the objective to maximise the metric, such as R2. Then, in each loop, we will also tune the regressor estimator by using Halving Grid Search CV. 

In [None]:
tuned_model = None
target_transformers = [TransformedTargetRegressor(regressor=tuned_model, 
                           func=np.log10, 
                           inverse_func=sp.special.exp10),
                       TransformedTargetRegressor(regressor=tuned_model)]

categorical_transformers = [OneHotEncoder(handle_unknown='ignore'), 
                            OrdinalEncoder()]

scaling_transformers = [Normalizer(), 
                        StandardScaler()]

param_grid = {'target_transformers':target_transformers,
              'categorical_transformers':categorical_transformers,
              'scaling_transformers':scaling_transformers}
grid = ParameterGrid(param_grid)

for n, para in enumerate(grid, start=1):
    print(n)
    print(para)
    print()

In [None]:
target_transformers = [TransformedTargetRegressor(regressor=tuned_model, 
                           func=np.log10, 
                           inverse_func=sp.special.exp10),
                       TransformedTargetRegressor(regressor=tuned_model)]

categorical_transformers = [OneHotEncoder(handle_unknown='ignore'), 
                            OrdinalEncoder()]

scaling_transformers = [Normalizer(), 
                        StandardScaler()]

param_grid = {'target_transformers':target_transformers,
              'categorical_transformers':categorical_transformers,
              'scaling_transformers':scaling_transformers}
grid = ParameterGrid(param_grid)

def assembling_model(categorical_transform, scale_transform,  \
                     X,  y, reg, param_distributions, tune):
        
    preprocessor = make_column_transformer(
        (categorical_transform, categorical_columns),
        (scale_transform, numerical_columns),
    )
    
    x_transform = preprocessor.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(x_transform, y, random_state=42)
    if tune:
        search = HalvingGridSearchCV(reg, param_distributions, random_state=42)
        search.fit(X_train, y_train)
        best_model = search.best_estimator_
        best_param = search.best_params_
    else:
        search = reg
        search.fit(X_train, y_train)
        best_model = reg
        best_param = None
    return best_model, X_train, X_test, y_train, y_test, best_param, preprocessor

def tuning_whole_algorithm(X,  y, reg, param_distributions, grid, tune=True):
    result_list = []
    for para in tqdm(grid):
        tuned_model, X_train, X_test, y_train, y_test, parameters, preprocessor_pipe = assembling_model(para['categorical_transformers'], para['scaling_transformers'], \
                         X,  y, reg, param_distributions, tune)
        
        para['target_transformers'].regressor = tuned_model
        model = para['target_transformers']
        
        # storing pipeline information
        pipeline_cache = make_pipeline(preprocessor_pipe, model)
        
        model.fit(X_train, y_train)
        prediction_test = model.predict(X_test)
        model_text_list=[]; metric_list=[]; score_list=[] ; param_list=[] ; preprocessors_list=[] ; pipelines_list = []

        # create list of metric to be examined
        metric_functions = [r2_score, r2_score, mean_squared_error,mean_squared_error,mean_absolute_error]
        metric_functions_text = ['R_Squared', 'Adj_R_Squared', 'MSE','RMSE','MAE']

        # for loop of each of the 5 metrics
        for metric_function, metric_function_text in zip(metric_functions, metric_functions_text):
            if metric_function_text == 'Adj_R_Squared':
                Adj_r2 = 1 - (1-r2_score(y_test, prediction_test)) * (len(y)-1)/(len(y)-X.shape[1]-1)
                model_text_list.append(type(model.regressor).__name__); metric_list.append(metric_function_text); score_list.append(Adj_r2); param_list.append(parameters); preprocessors_list.append(para); pipelines_list.append(pipeline_cache)
            elif metric_function_text == 'RMSE':
                rmse = mean_squared_error(y_test, prediction_test, squared=False)
                model_text_list.append(type(model.regressor).__name__); metric_list.append(metric_function_text); score_list.append(rmse); param_list.append(parameters) ; preprocessors_list.append(para); pipelines_list.append(pipeline_cache)
            else:
                model_text_list.append(type(model.regressor).__name__); metric_list.append(metric_function_text); score_list.append(metric_function(y_test, prediction_test)); param_list.append(parameters) ; preprocessors_list.append(para); pipelines_list.append(pipeline_cache)

        d = {'model':model_text_list,'preprocessors':preprocessors_list ,'parameters': param_list ,'metric': metric_list, 'test predict score': score_list, 'Pipelines': pipelines_list}
        df = pd.DataFrame(data=d)
        result_list.append(df)
    df2 = pd.concat(result_list).reset_index(drop=True)
    return df2

# <b>2.3 <span style='color:red'>|</span> Define Residual Plotting Functions </b> <a class="anchor" id="2.3"></a>


In [None]:
def residual(model,model_name,X_train,X_test,y_train,y_test):
    model.fit(X_train, y_train)
    prediction_train = model.predict(X_train)
    prediction_test = model.predict(X_test)
    
    # PERFORMANCE METRICS
    mse_test = mean_squared_error(y_test, prediction_test, squared=True)
    rmse_test = mean_squared_error(y_test, prediction_test, squared=False)
    mse_train = mean_squared_error(y_train, prediction_train, squared=True)
    rmse_train = mean_squared_error(y_train, prediction_train, squared=False)
    
    # RESIDUAL FOR ACTUAL AND LOGGED
    residual_train = y_train - prediction_train
    residual_test = y_test - prediction_test
    
    fig, axs = plt.subplots(nrows = 1, ncols = 2)
    fig = matplotlib.pyplot.gcf()
    fig.subplots_adjust(wspace=0.1)
    fig.set_size_inches(12, 5)
    fig.suptitle(f"Residual of Calibrated {model_name}", fontsize=14,fontweight='bold')

    axs[0].sharex(axs[1])
    axs[0].sharey(axs[1])

    axs[0].scatter(x = prediction_train,y = residual_train, alpha=0.1,color='red',label='Train Set')
    axs[0].set_title(f'Training Set',fontweight='bold')
    axs[0].set_xlabel('Predicted Values')
    axs[0].set_ylabel('Residual')
    yabs_max = abs(max(axs[0].get_ylim(), key=abs))
    axs[0].axhline(y=0, color='black', linestyle='--', label='Zero Residual')
    axs[0].legend()
    
    axs[1].scatter(x = prediction_test,y = residual_test, alpha=0.1,color='blue',label='Test Set')
    axs[1].set_title(f'Testing Set',fontweight='bold')
    axs[1].set_xlabel('Predicted Values')
    yabs_max = abs(max(axs[1].get_ylim(), key=abs))
    axs[1].axhline(y=0, color='black', linestyle='--', label='Zero Residual')
    axs[1].legend()

    props = dict(boxstyle='square', facecolor='whitesmoke', alpha=1, pad=0.5)
    axs[0].text(0.6, 0.75, f"MSE = {mse_train:.2f} \nRMSE = {rmse_train:.2f}", transform=axs[0].transAxes, fontsize=10,
        verticalalignment='top', bbox=props)
    axs[1].text(0.6, 0.75, f"MSE = {mse_test:.2f} \nRMSE = {rmse_test:.2f}", transform=axs[1].transAxes, fontsize=10,
        verticalalignment='top', bbox=props)
    return plt.show()


def get_best_model(df):
    df_t = df[df.metric== 'Adj_R_Squared']
    bestmodel = df_t.loc[df_t['test predict score'].idxmax()].Pipelines
    model_name = df_t.loc[df_t['test predict score'].idxmax()].model
    return bestmodel, model_name

def get_best_result(df_result):
    df_result_t = df_result[df_result.metric== 'R_Squared']
    r2_df = df_result_t.loc[df_result_t['test predict score'].idxmax()].to_frame().T

    df_result_t = df_result[df_result.metric== 'Adj_R_Squared']
    adjr2_df = df_result_t.loc[df_result_t['test predict score'].idxmax()].to_frame().T
    
    df_result_t = df_result[df_result.metric== 'MSE']
    mse_df = df_result_t.loc[df_result_t['test predict score'].idxmin()].to_frame().T

    df_result_t = df_result[df_result.metric== 'RMSE']
    rmse_df = df_result_t.loc[df_result_t['test predict score'].idxmin()].to_frame().T
    
    df_result_t = df_result[df_result.metric== 'MAE']
    mae_df = df_result_t.loc[df_result_t['test predict score'].idxmin()].to_frame().T
    return pd.concat([r2_df,adjr2_df,mse_df,rmse_df,mae_df])

In [None]:
X = df3.drop('salary_in_usd', axis=1)
y = df3.salary_in_usd.values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

categorical_columns = ['job_title']
numerical_columns = X.columns.tolist()
numerical_columns.remove('job_title')
numerical_columns

# <b>2.4 <span style='color:red'>|</span> Modelling - Training + Evaluations </b> <a class="anchor" id="2.4"></a>


### <b>2.4.1 <span style='color:red'>|</span> LinearRegression </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = LinearRegression()
param_distributions = None
pd.set_option('display.max_colwidth', 200)
df_result_lr = tuning_whole_algorithm(X,  y, reg, param_distributions, grid, tune=False)

# visualise residual
bestmodel, model_name = get_best_model(df_result_lr)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_lr)

### <b>2.4.2 <span style='color:red'>|</span> Ridge </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = Ridge(alpha=.5)

param_distributions = None

df_result_ridge = tuning_whole_algorithm(X,  y, reg, param_distributions, grid, tune=False)

# visualise residual
bestmodel, model_name = get_best_model(df_result_ridge)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_ridge)

### <b>2.4.3 <span style='color:red'>|</span> Lasso </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = Lasso(alpha=0.1, tol = 0.2)

param_distributions = None

df_result_lasso = tuning_whole_algorithm(X,  y, reg, param_distributions, grid, tune=False)

# visualise residual
bestmodel, model_name = get_best_model(df_result_lasso)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_lasso)

### <b>2.4.4 <span style='color:red'>|</span> KNeighborsRegressor </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = KNeighborsRegressor()

param_distributions = {'n_neighbors': [5, 7, 9, 13], 'weights': ['uniform', 'distance']}
df_result_knn = tuning_whole_algorithm(X,  y, reg, param_distributions, grid, tune=True)

# visualise residual
bestmodel, model_name = get_best_model(df_result_knn)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_knn)

### <b>2.4.5 <span style='color:red'>|</span> SVR </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = SVR()
param_distributions = {'kernel': ['rbf','poly'],'C':[50,100,200,300,400]}

df_result_svr = tuning_whole_algorithm(X,  y, reg, param_distributions, grid)

# visualise residual
bestmodel, model_name = get_best_model(df_result_svr)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_svr)

### <b>2.4.6 <span style='color:red'>|</span> DecisionTreeRegressor </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = DecisionTreeRegressor(random_state=42)

param_distributions = {'max_depth': [3, 5, None], 'min_samples_split': [2, 3, 5]}
df_result_dtr = tuning_whole_algorithm(X,  y, reg, param_distributions, grid)

# visualise residual
bestmodel, model_name = get_best_model(df_result_dtr)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_dtr)

### <b>2.4.7 <span style='color:red'>|</span> RandomForestRegressor </b> <a class="anchor" id="1.1"></a>

In [None]:
reg = RandomForestRegressor(random_state=42)
param_distributions = {'max_depth': [3, 5, None], 'n_estimators': [100, 300]}
df_result_rfr = tuning_whole_algorithm(X,  y, reg, param_distributions, grid)

# visualise residual
bestmodel, model_name = get_best_model(df_result_rfr)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_rfr)

### <b>2.4.8 <span style='color:red'>|</span> GradientBoostingRegressor </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = GradientBoostingRegressor(random_state=42)
param_distributions = {'learning_rate': [0.1, 0.2, 0.5]}
df_result_gbr = tuning_whole_algorithm(X,  y, reg, param_distributions, grid)

# visualise residual
bestmodel, model_name = get_best_model(df_result_gbr)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_gbr)

### <b>2.4.9 <span style='color:red'>|</span> HistGradientBoostingRegressor </b> <a class="anchor" id="1.1"></a>


In [None]:
param_distributions = {'learning_rate': [0.1, 0.2, 0.5]}

reg = HistGradientBoostingRegressor(random_state=42)
df_result_hgb = tuning_whole_algorithm(X,  y, reg, param_distributions, grid)

# visualise residual
bestmodel, model_name = get_best_model(df_result_hgb)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_hgb)

### <b>2.4.10 <span style='color:red'>|</span> AdaBoostRegressor </b> <a class="anchor" id="1.1"></a>


In [None]:
reg = AdaBoostRegressor(random_state=42)
df_result_ada = tuning_whole_algorithm(X,  y, reg, None, grid, tune=False)
# visualise residual
bestmodel, model_name = get_best_model(df_result_ada)
residual(bestmodel, model_name,X_train,X_test,y_train,y_test)

# show result
get_best_result(df_result_ada)

# <b>2.5 <span style='color:red'>|</span> Concatenate Results </b> <a class="anchor" id="2.5"></a>


In [None]:
df_result = pd.concat([df_result_lr, df_result_ridge, df_result_lasso, 
                       df_result_knn, df_result_svr, df_result_dtr,
                       df_result_rfr, df_result_hgb, df_result_gbr,
                       df_result_ada]).reset_index(drop=True)
df_result

# <b>2.6 <span style='color:red'>|</span> Get Overall Best Results </b> <a class="anchor" id="2.6"></a>


In [None]:
get_best_result(df_result)

# <b>2.7 <span style='color:red'>|</span> Make a Final Prediction </b> <a class="anchor" id="2.7"></a>


In [None]:
df_result_t = df_result[df_result.metric== 'Adj_R_Squared']
BEST_model = df_result_t.loc[df_result_t['test predict score'].idxmax()].Pipelines.fit(X_train, y_train)

print('PREDICTED VALUES')
best_predict = BEST_model.predict(X_test)
pd.DataFrame({'Actual Y-Test Salary':y_test, 'Best Predicted Salary':best_predict})

As a conclusion, this notebook has shown 10 regressor made up with parametric, non-parametric and ensemble methods to predict on data science salary. It has also shown how to tuned the pre-processor along with the estimator parameters. The best model can be retrieved at the end to make a final prediction. Hope you enjoy reading this, if you like this notebook, please upvote and comment. Thank you.

Author: Morris Lee <br>
Date: 4-9-2022