# Individual Final Project
## Student ID: GH1024311
## Christian Jensen
## [URL](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023) for the Dataset

# Problem Statement
## The data science demand is growing rapidly across multiple industries and businesses. This creates bigger opportunities for data scientists such as myself to get a job in any field of work that we can imagine. That in turn creates the problem of not really knowing how much would be our expected salary. There are people who apply to the job without having a clue of how much would be the right ammount which would leave the applicant underpayed, or ask for too much and not get the job at all. With a machine learning model, those questions can be answered based on different parameters, like education level, the company geographical location, level of experience, employment type, and the company size.
## The company helps out people who are looking for jobs, they not only focus on doing mock-up interviews with the customer but also give them insight on the salary expectancy. With this machine learning model, the company will be able to accurately give the customer a precise range of salary that the customer should ask to the company being applied to. The data is based off real people working real jobs in real companies who have shared their salaries and other details that help the model become more efficient and precise on predicting the salary range.

# Data Exploration and Characteristics
## As it'll be shown later, the dataset doesn't suffer from missing values, it is a very complete datset. It also doesn't suffer from quality. The evaluation metrics that will better fit the dataset are: r2, and mean squared error.

# Step 0: Import Useful Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection
import sklearn.compose
import sklearn.preprocessing
import sklearn.svm
from sklearn.metrics import mean_squared_error, r2_score
import sklearn.linear_model
import sklearn.ensemble
from sklearn.model_selection import KFold
from sklearn.utils import resample

# Step 1: Load the .CSV File using 'pd.read_csv()'

In [None]:
df = pd.read_csv('ds_salaries.csv')
display(df.head(5))
df.shape

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


(3755, 11)

In [None]:
# I learned what the previews categories meant here: https://ai-jobs.net/salaries/download/
meanings = {'FT':'Full Time','CT':'Contract','FL':'Freelance','PT':'Part Time',
            'SE':'Senior-Level / Expert','MI':'Mid-Level / Intermediate',
            'EN':'Entry-Level / Junior','EX':'Executive-Level / Director'}
# Created this list to have in handy whenever I'm not sure what any of the code words mean.

In [None]:
df = df.drop(['salary','salary_currency'], axis = 1)
df.head(5)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,CA,100,CA,M


In [None]:
print(df.isnull().sum().sum())
print(df.isna().sum().sum())

0
0


### Adding another column that divides the salaries into 4 different categories.

In [None]:
df['salary_category'] = pd.cut(df['salary_in_usd'], bins=[0,50000,75000,100000,np.inf],
                               labels=['< 50k','75k < x <= 50k','100k < x <= 75k','> 100k'])
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,salary_category
0,2023,SE,FT,Principal Data Scientist,85847,ES,100,ES,L,100k < x <= 75k
1,2023,MI,CT,ML Engineer,30000,US,100,US,S,< 50k
2,2023,MI,CT,ML Engineer,25500,US,100,US,S,< 50k
3,2023,SE,FT,Data Scientist,175000,CA,100,CA,M,> 100k
4,2023,SE,FT,Data Scientist,120000,CA,100,CA,M,> 100k


In [None]:
df['salary_category'].value_counts()

salary_category
> 100k             2665
100k < x <= 75k     458
75k < x <= 50k      345
< 50k               287
Name: count, dtype: int64

# Step 2: Splitting the Dataset

In [None]:
df_train, df_test = sklearn.model_selection.train_test_split(df, test_size = 0.2)
print(f'DF size: {df.shape}')
print(f'DF Train size: {df_train.shape}')
print(f'DF Test size: {df_test.shape}')

DF size: (3755, 10)
DF Train size: (3004, 10)
DF Test size: (751, 10)


### We will use One Hot Encoder

### Using 'LabelEncoder()' for the following column.

In [None]:
Label_Encoder = sklearn.preprocessing.LabelEncoder()
Label_Encoder.fit(df_train['salary_category'])
df_train['salary_category'] = Label_Encoder.transform(df_train['salary_category'])

In [None]:
df_test['salary_category'] = Label_Encoder.transform(df_test['salary_category'])

# Step 3: Data Pre-Processing and Feature Engineering

In [None]:
df_train.dtypes

work_year              int64
experience_level      object
employment_type       object
job_title             object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
salary_category        int64
dtype: object

In [None]:
print(df_train.isnull().sum().sum())
print(df_train.isna().sum().sum())

0
0


In [None]:
# List to hold oversampled data
oversampled_data = []

# Check the count of each category in the 'salary_range' column
# (or whatever the column name is)
category_counts = df_train['salary_category'].value_counts()

# Get the majority category count
majority_count = category_counts.max()

# Iterate through each category
for category in category_counts.index:
    # Get all rows belonging to this category
    category_data = df_train[df_train['salary_category'] == category]

    # Oversample to match the majority count
    oversampled_category_data = category_data.sample(majority_count, replace=True)

    # Append to the list
    oversampled_data.append(oversampled_category_data)

# Concatenate all the oversampled data to form a balanced DataFrame
balanced_df_train = pd.concat(oversampled_data)

In [None]:
# This is made to shuffle the dataset
balanced_df_train = balanced_df_train.sample(frac=1).reset_index(drop=True)

In [None]:
balanced_df_train['salary_category'].value_counts()

salary_category
1    2125
0    2125
3    2125
2    2125
Name: count, dtype: int64

In [None]:
# from sklearn.utils import resample

# minority_class_label = ['< 50k','75k < x <= 50k','100k < x <= 75k']
# desired_sample_sizes = [200, 200, 200]

# majority_class = df_train[df_train['salary_category'] == '> 100k']
# minority_classes = [df_train[df_train['salary_category'] == label] for label in minority_class_label]

# resampled_minority_classes = [resample(minority_class, replace = True, n_samples = desired_sample_size,
#                                        random_state = 42) for minority_class, desired_sample_size in zip(minority_classes, desired_sample_sizes)
#                              ]

# df_resampled = pd.concat([majority_class] + resampled_minority_classes)

# df_resampled['salary_category'].value_counts()

In [None]:
# from sklearn.utils import resample
# import pandas as pd

# # Define the minority class labels and their desired sample sizes
# minority_class_labels = ['100k < x <= 75k', '75k < x <= 50k', '< 50k']
# desired_sample_sizes = [200, 200, 200]  # Example values, adjust according to your requirements

# # Assuming majority_class is defined elsewhere
# majority_class = df_train[df_train['salary_category'] == '> 100k']

# # Assuming df_train contains a column 'salary_category' indicating the salary category for each data point
# minority_classes = [df_train[df_train['salary_category'] == label] for label in minority_class_labels]

# # Resample each minority class separately
# resampled_minority_classes = [[resample(minority_class, replace=True, random_state=42) for minority_class in zip(minority_classes)]]

# # Combine majority class and upsampled minority classes
# df_resampled = pd.concat([majority_class] + resampled_minority_classes, ignore_index=True)

# # Check the class distribution after resampling
# print(df_resampled["salary_category"].value_counts())

# Step 4: Prepare the Train and Test Datasets

In [None]:
x_train = balanced_df_train.drop(['salary_in_usd','salary_category'], axis = 1)
y_train = balanced_df_train['salary_category']

x_test = df_test.drop(['salary_in_usd','salary_category'], axis = 1)
y_test = df_test['salary_category']

###########################

print('x_train size: ', x_train.shape)
print('y_train size: ', y_train.shape)

print('x_test size: ', x_test.shape)
print('y_test size: ', y_test.shape)

In [None]:
categorical_attributes = x_train.select_dtypes(include = ['object']).columns.tolist()
numerical_attributes = x_train.select_dtypes(include = ['int64']).columns.tolist()

ct = sklearn.compose.ColumnTransformer([
    ('standard_scaling', sklearn.preprocessing.StandardScaler(), numerical_attributes),
    ('one_hot_encoding', sklearn.preprocessing.OneHotEncoder(handle_unknown = 'ignore'), categorical_attributes),
])

ct.fit(x_train)
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

In [None]:
# from imblearn.over_sampling import SMOTE

# smote = SMOTE(random_state = 42)

# x_train_resampled, y_train_resampled = smote.fit_resample(x_train,y_train)

In [None]:
print('x_train size: ', x_train.shape)
print('y_train size: ', y_train.shape)

print('x_test size: ', x_test.shape)
print('y_test size: ', y_test.shape)

x_train size:  (8500, 237)
y_train size:  (8500,)
x_test size:  (751, 237)
y_test size:  (751,)


# Step 7: Model Assessment

In [None]:
# print(f'SVC accuracy: {accuracySVC}')
# print(f'RFC accuracy: {accuracyRFC}')
# print(f'GBC accuracy: {accuracyGBC}')

# Step 6: Model Training

# Step 8: HyperParameter / Fine Tuning the Model
- Using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### We use K-Fold to do crossvalidation on the training dataset (dummy_df)

In [None]:
kf = KFold(n_splits=5)

In [None]:
param_gridSVC = {
    'gamma' : ['scale', 'auto'],
    'decision_function_shape' : ['ovo', 'ovr']
}
Tunning_SVC = sklearn.model_selection.GridSearchCV(estimator =
                                                   sklearn.svm.SVC(),
                                                   param_grid = param_gridSVC,
                                                   cv = kf, scoring = 'accuracy'
                                                   )

new_resultSVC = Tunning_SVC.fit(x_train, y_train)

In [None]:
param_grid1 = {
    'n_estimators' : [50, 100],
    'max_depth' : [5, 10, 30],
    'min_samples_split' : [2, 5],
    'bootstrap' : [True, False],
    'max_features' : ['sqrt', 'log2']
}
Tunning_RFC = sklearn.model_selection.GridSearchCV(estimator =
                                                   sklearn.ensemble.RandomForestClassifier(),
                                                   param_grid = param_grid1,
                                                   cv=3, scoring = 'accuracy'
                                                   )
new_resultRFC = Tunning_RFC.fit(x_train, y_train)

In [None]:
param_grid2 = {
    'n_estimators' : [50, 100],
    'min_samples_split' : [2, 5],
    'max_features' : ['sqrt', 'log2']
}
Tunning_GBC = sklearn.model_selection.GridSearchCV(estimator =
                                                   sklearn.ensemble.GradientBoostingClassifier(),
                                                   param_grid = param_grid2,
                                                   cv=3, scoring = 'accuracy'
                                                   )
new_resultGBC = Tunning_GBC.fit(x_train, y_train)

In [None]:
print(Tunning_SVC.best_estimator_)
print(Tunning_SVC.best_score_)
print(Tunning_RFC.best_estimator_)
print(Tunning_RFC.best_score_)
print(Tunning_GBC.best_estimator_)
print(Tunning_GBC.best_score_)

SVC(decision_function_shape='ovo')
0.73
RandomForestClassifier(max_depth=30, n_estimators=50)
0.7622340558476596
GradientBoostingClassifier(max_features='sqrt')
0.660116558193612


In [None]:
y_predictedRFC = Tunning_RFC.predict(x_test)

In [None]:
new_accuracyRFC = sklearn.metrics.accuracy_score(y_test, y_predictedRFC)

In [None]:
print(f'New RFC accuracy: {new_accuracyRFC}')

New RFC accuracy: 0.6790945406125166


# Overall Strengths and Limitations
- The strength of this Machine Learning Model is the possibility of predicting the salary expectancy of an applicant based on parameters that are easy to imput and also factible to get, for example the company size and location. The other parameters will depend directly on the applicant.
- The weakness of this Model is the fact that it will not be able to give a specific number, it can only give an interval of the ammount of salary the applicant should ask for. So the applicant will have to cope with a salary interval and not a specific ammount.

# Data-driven Recommendations
- Use the insight from the Model to ensure equal and fair salary distribution.
- Work on creating salary negotiation coaching for your customers and use the model to properly predict the salary range of the applicant.
- Integrate the Model into a payed app to present the salary range depending on the parameters.

# The Most Informative Features

In [None]:
informative_features = dict(zip(Tunning_RFC.best_estimator_.feature_names_in_, Tunning_RFC.best_estimator_.feature_importances_))
informative_features = {k: v for k, v in sorted(informative_features.items(), key = lambda x: x[1], reverse = True)}

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'

In [None]:
informative_features

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd '/content/drive/My Drive/Colab Notebooks'

In [None]:
!jupyter nbconvert --to html Final_Assessment_AI_and_ML.ipynb