<h1 style="color:darkmagenta; text-align:center; font-family:Cursive">
    <b>Salary Prediction
    <b><br>of
    <b><br>Data Science Job Salary
</h1>

## <div><p style="color:#b50264; font-family:Cursive"><b>🎯 Notebook Goal</p></div>

1. **Predicting the salary** of the Data Science Job market. 💸

## <div><p style="color:#b50264; font-family:Cursive"><b>🏷️ Table of Contents</p></div>
<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">

   1. [Import Necessary Libraries](#1)
   2. [Getting Data](#2)
   3. [Pre-processing](#3)
      - 3.1 [Outliers](#3.1)
         - 3.1.1 [Finding Outliers](#3.1.1)
         - 3.1.2 [Removing Outliers](#3.1.2)
      - 3.2 [Remove Duplicates](#3.2)
      - 3.3 [Data Scaling](#3.3)
   4. [Modeling](#4)

<a id="1"></a>
## <div><p style="color:darkmagenta; font-family:Cursive"><b>1. Import Necessary Libraries</p></div>

In [1]:
# data
import pandas as pd
import numpy as np

# sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

# models
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

<a id="2"></a>
## <div><p style="color:darkmagenta; font-family:Cursive"><b>2. Getting Data</p></div>

In [2]:
# read data and create DataFrame
df = pd.read_csv("../data_given/ds_salaries.csv", index_col=0)
df.head(3)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M


In [3]:
# shape of the dataset
print("Shape of the dataset: ", df.shape)

Shape of the dataset:  (607, 11)


<a id="3"></a>
## <div><p style="color:darkmagenta; font-family:Cursive"><b>3. Pre-processing</p></div>

<a id="3.1"></a>
### <div><p style="color:#b50264; font-family:Cursive"><b>3.1. Outliers</p></div>

<a id="3.1.1"></a>
#### <div><p style="color:MediumVioletRed; font-family:Cursive"><b>3.1.1. Finding Outliers</p></div>

In [4]:
# Quartile information show
def show_quartile(quartile, percentile, col_name, quarter_value):
    print("Value of {}:".format(quartile))
    print(f"{quartile}: {percentile} percentile of the {col_name} values is: ", quarter_value)

In [5]:
# Calculate interquartile range
def interquartile_range(quartile):
    q1 = quartile[0]
    q3 = quartile[2]
    return q3 - q1

In [6]:
# Limit finder
def limit_finder(quartile, interquartile_range):
    q1 = quartile[0]
    q3 = quartile[2]

    low = q1 - 1.5 * interquartile_range
    up = q3 + 1.5 * interquartile_range

    return low, up

In [7]:
# Finding outliers
def find_outliers(data, low, up):
    outliers = []
    for value in data:
        if (value < low) or (value > up):
            outliers.append(value)

    return outliers

In [8]:
# Outliers detection method
def detect_outlier(data):
    quartile_list = []
    sorted_data = data.sort_values()

    quarters = {"Q1": 25, "Q2": 50, "Q3": 75}
    for q, p in quarters.items():
        quarter_value = np.percentile(sorted_data, p, method = "midpoint")
        quartile_list.append(quarter_value)
        show_quartile(q, p, data.name, quarter_value)

    iqr = interquartile_range(quartile_list)
    print("\nInterquartile range is: ", iqr)

    low_limit, up_limit = limit_finder(quartile_list, iqr)
    print("\nLow Limit is: ", low_limit)
    print("Up Limit is: ", up_limit)

    outliers = find_outliers(sorted_data, low_limit, up_limit)
    print("\nOutliers in the dataset is: ", outliers)
    return (low_limit, up_limit)

In [9]:
low_limit, high_limit = detect_outlier(df["salary_in_usd"])

Value of Q1:
Q1: 25 percentile of the salary_in_usd values is:  62726.0
Value of Q2:
Q2: 50 percentile of the salary_in_usd values is:  101570.0
Value of Q3:
Q3: 75 percentile of the salary_in_usd values is:  150000.0

Interquartile range is:  87274.0

Low Limit is:  -68185.0
Up Limit is:  280911.0

Outliers in the dataset is:  [324000, 325000, 380000, 405000, 412000, 416000, 423000, 450000, 450000, 600000]


<a id="3.1.2"></a>
#### <div><p style="color:MediumVioletRed; font-family:Cursive"><b>3.1.2. Removing Outliers</p></div>

In [10]:
df = df[(df["salary_in_usd"] < high_limit) & (df["salary_in_usd"] > low_limit)]
print("Minimum Salary of DS Job: ", df["salary_in_usd"].min())
print("Maximum Salary of DS Job: ", df["salary_in_usd"].max())
print("Update: New shape of the dataset: ", df.shape)

Minimum Salary of DS Job:  2859
Maximum Salary of DS Job:  276000
Update: New shape of the dataset:  (597, 11)


<div style="color:DarkSlateGray;text-align:center;border-bottom:2px dotted gray; border-left: 2px dotted gray;border-right: 2px dotted gray">
    <h4>Note</h4>
    <p style="text-align:center;">
        <span style='font-weight: bold'>Minimum Salary</span> of DS Job is <span style='font-weight: bold'>2859 USD</span>.
        <br>
        <span style='font-weight: bold'>Maximum Salary</span> of DS Job is <span style='font-weight: bold'>276000 USD</span>.
        <br>
        <span style='font-weight: bold'>The new reduced shape of dataset</span> is <span style='font-weight: bold'>(597, 11)</span>.
    </p>
</div>

<a id="3.2"></a>
### <div><p style="color:#b50264; font-family:Cursive"><b>3.2. Remove Duplicates</p></div>

In [11]:
# Number of duplicate values
duplicate_count = len(df[df.duplicated()])
print("There are {} duplicate rows in the dataset ..?".format(duplicate_count))

There are 42 duplicate rows in the dataset ..?


In [12]:
# Remove all the duplicates
df = df.drop_duplicates()
print("Update: Final Shape of the dataset: ", df.shape)

Update: Final Shape of the dataset:  (555, 11)


<div style="color:DarkSlateGray;text-align:center;border-bottom:2px dotted gray; border-left: 2px dotted gray;border-right: 2px dotted gray">
    <h4>Note</h4>
    <p style="text-align:center;">
        Shape of <span style='font-weight: bold'>given dataset</span> - <span style='font-weight: bold'>(607, 11)</span>.
        <br>
        Shape of <span style='font-weight: bold'>final dataset</span> - <span style='font-weight: bold'>(555, 11)</span>.
    </p>
</div>


<a id="3.3"></a>
### <div><p style="color:#b50264; font-family:Cursive"><b>3.3. Data Scaling</p></div>

In [13]:
# define global random state
np.random.seed(42)

In [14]:
# shuffle
df = df.sample(frac = 1, random_state = 42)

In [15]:
# Convert categorical variable into indicator variables
data = pd.get_dummies(df)

In [16]:
# Scaling data
scaler = MinMaxScaler()

y = data["salary_in_usd"]
X = data.drop(["salary_in_usd"], axis = 1)
X = scaler.fit_transform(X)



<a id="4"></a>
## <div><p style="color:darkmagenta; font-family:Cursive"><b>4. Modeling</p></div>

In [17]:
# Splitting dataset into training and testing sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [18]:
# Modeling the dataset with different models
def models(train_X, test_X, train_y, test_y):
    gbr = GradientBoostingRegressor()
    knr = KNeighborsRegressor()
    dtr = DecisionTreeRegressor()
    rfr = RandomForestRegressor()

    algorithms = [gbr, knr, dtr, rfr]
    algo_names = ["Gradient Boosting",
                  "K-Neighbors",
                  "Decision Tree",
                  "Random Forest"]

    r_score = []
    mse = []
    mae = []
    result = pd.DataFrame(columns = ["R_Square", "MSE", "MAE"], index = algo_names)

    for algo in algorithms:
        pred = algo.fit(train_X, train_y).predict(test_X)
        r_score.append(r2_score(test_y, pred))
        mse.append(mean_squared_error(test_y, pred) ** 0.5)
        mae.append(mean_absolute_error(test_y, pred))

    result["R_Square"] = r_score
    result["MSE"] = mse
    result["MAE"] = mae

    return result.sort_values("R_Square", ascending = False)

In [19]:
print("Performance:")
models(train_X, test_X, train_y, test_y)

Performance:


Unnamed: 0,R_Square,MSE,MAE
Gradient Boosting,0.973242,9304.109381,5440.095491
Decision Tree,0.960994,11233.414349,4735.207207
Random Forest,0.944185,13437.491197,4830.01973
K-Neighbors,0.668456,32750.255502,24300.684685


<div style="color:DarkSlateGray;text-align:center;border-bottom:2px dotted gray; border-left: 2px dotted gray;border-right: 2px dotted gray">
    <h4>Note</h4>
    <p style="text-align:center;">
        In performance matrix areas, <span style='font-weight: bold'>Gradient Boosting</span> outperforms the other models (R-Square value: 97.32%).
        <br>
        Next best fitted models are <span style='font-weight: bold'>Decision Tree</span> and and <span style='font-weight: bold'>Random Forest</span> (R-Square value respectively 96.09% and 94.41%).
        <br>
        <span style='font-weight: bold'>K-Neighbors</span> performs worst in this dataset (66.84%).
    </p>
</div>