# Data Science Salary Prediction

## Column Description
1. `job_title`:	The job title or role associated with the reported salary.
2. `experience_level`:	The level of experience of the individual.
3. `employment_type`:	Indicates whether the employment is full-time, part-time, etc.
4. `work_models`:	Describes different working models (remote, on-site, hybrid).
5. `work_year`:	The specific year in which the salary information was recorded.
6. `employee_residence`:	The residence location of the employee.
7. `salary`:	The reported salary in the original currency.
8. `salary_currency`:	The currency in which the salary is denominated.
9. `salary_in_usd`:	The converted salary in US dollars.
10. `company_location`:	The geographic location of the employing organization.
11. `company_size`:	The size of the company, categorized by the number of employees.


Data source: https://www.kaggle.com/datasets/sazidthe1/data-science-salaries

In [1]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

salary_df = pd.read_csv('./dataset/data_science_salaries.csv')
salary_df.head()

Unnamed: 0,job_title,experience_level,employment_type,work_models,work_year,employee_residence,salary,salary_currency,salary_in_usd,company_location,company_size
0,Data Engineer,Mid-level,Full-time,Remote,2024,United States,148100,USD,148100,United States,Medium
1,Data Engineer,Mid-level,Full-time,Remote,2024,United States,98700,USD,98700,United States,Medium
2,Data Scientist,Senior-level,Full-time,Remote,2024,United States,140032,USD,140032,United States,Medium
3,Data Scientist,Senior-level,Full-time,Remote,2024,United States,100022,USD,100022,United States,Medium
4,BI Developer,Mid-level,Full-time,On-site,2024,United States,120000,USD,120000,United States,Medium


In [2]:
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11087 entries, 0 to 11086
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   job_title           11087 non-null  object
 1   experience_level    11087 non-null  object
 2   employment_type     11087 non-null  object
 3   work_models         11087 non-null  object
 4   work_year           11087 non-null  int64 
 5   employee_residence  11087 non-null  object
 6   salary              11087 non-null  int64 
 7   salary_currency     11087 non-null  object
 8   salary_in_usd       11087 non-null  int64 
 9   company_location    11087 non-null  object
 10  company_size        11087 non-null  object
dtypes: int64(3), object(8)
memory usage: 952.9+ KB


In [3]:
salary_df.describe()

Unnamed: 0,work_year,salary,salary_in_usd
count,11087.0,11087.0,11087.0
mean,2022.848381,169572.3,149614.977631
std,0.567803,408031.1,66704.329347
min,2020.0,14000.0,15000.0
25%,2023.0,105000.0,104000.0
50%,2023.0,142200.0,142000.0
75%,2023.0,188050.0,185900.0
max,2024.0,30400000.0,750000.0


In [4]:
salary_df.shape

(11087, 11)

In [5]:
# check unique values of categorical columns
columns = salary_df.columns.to_list()
numerical_columns = ['work_year', 'salary', 'salary_in_usd']
categorical_columns = list(set(columns).difference(numerical_columns))

categorical = salary_df[categorical_columns]
for cat in categorical_columns:
    print(f'Unique Values for {cat.upper()}: \n{salary_df[cat].unique()} \
          \nTotal Unique Values: {len(salary_df[cat].unique())}\n\n')

Unique Values for EMPLOYMENT_TYPE: 
['Full-time' 'Part-time' 'Contract' 'Freelance']           
Total Unique Values: 4


Unique Values for EXPERIENCE_LEVEL: 
['Mid-level' 'Senior-level' 'Entry-level' 'Executive-level']           
Total Unique Values: 4


Unique Values for JOB_TITLE: 
['Data Engineer' 'Data Scientist' 'BI Developer' 'Research Analyst'
 'Business Intelligence Developer' 'Data Analyst'
 'Director of Data Science' 'MLOps Engineer' 'Machine Learning Scientist'
 'Machine Learning Engineer' 'Data Science Manager' 'Applied Scientist'
 'Business Intelligence Analyst' 'Analytics Engineer'
 'Business Intelligence Engineer' 'Data Science' 'Research Scientist'
 'Research Engineer' 'Managing Director Data Science' 'AI Engineer'
 'Data Specialist' 'Data Architect' 'Data Visualization Specialist'
 'ETL Developer' 'Data Science Practitioner' 'Computer Vision Engineer'
 'Data Lead' 'ML Engineer' 'Data Developer' 'Data Modeler'
 'Data Science Consultant' 'AI Architect' 'Data Analytics Ma

In [6]:
# convert unique categorical values < 10 to category dtype to save memory usage
cat_columns_small = [cat for cat in categorical_columns if len(salary_df[cat].unique()) < 10]
cat_columns_small

['employment_type', 'experience_level', 'company_size', 'work_models']

In [7]:
for cat in cat_columns_small:
    salary_df[cat] = salary_df[cat].astype('category')
    
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11087 entries, 0 to 11086
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   job_title           11087 non-null  object  
 1   experience_level    11087 non-null  category
 2   employment_type     11087 non-null  category
 3   work_models         11087 non-null  category
 4   work_year           11087 non-null  int64   
 5   employee_residence  11087 non-null  object  
 6   salary              11087 non-null  int64   
 7   salary_currency     11087 non-null  object  
 8   salary_in_usd       11087 non-null  int64   
 9   company_location    11087 non-null  object  
 10  company_size        11087 non-null  category
dtypes: category(4), int64(3), object(4)
memory usage: 650.4+ KB


In [8]:
# clean job title
director_filter = salary_df.job_title.str.contains('Director')
principal_filter = salary_df.job_title.str.contains('Principal')
lead_filter = salary_df.job_title.str.contains('Lead')
manager_filter = salary_df.job_title.str.contains('Manager')
head_filter = salary_df.job_title.str.contains('Head')

da_filter = salary_df.job_title.str.contains('Analy')
ds_filter = salary_df.job_title.str.contains('D.* Scien.*')
de_filter = salary_df.job_title.str.contains('Da.* Engineer.*')
ml_filter = salary_df.job_title.str.contains('M.* Engineer.*')

salary_df[da_filter]['job_title'].unique()

array(['Research Analyst', 'Data Analyst',
       'Business Intelligence Analyst', 'Analytics Engineer',
       'Data Analytics Manager', 'Data Quality Analyst',
       'Data Analytics Lead', 'Data Management Analyst', 'BI Analyst',
       'Business Data Analyst', 'Financial Data Analyst',
       'Data Operations Analyst', 'BI Data Analyst',
       'Compliance Data Analyst', 'Business Intelligence Data Analyst',
       'Product Data Analyst', 'Data Visualization Analyst',
       'Finance Data Analyst', 'Lead Data Analyst',
       'Data Analytics Specialist', 'Staff Data Analyst',
       'Insight Analyst', 'Data Analyst Lead',
       'Analytics Engineering Manager', 'Data Analytics Consultant',
       'Data Analytics Engineer', 'Marketing Data Analyst',
       'Principal Data Analyst', 'Sales Data Analyst'], dtype=object)

In [9]:
salary_df.loc[(da_filter) & (~lead_filter) & (~manager_filter), 'job_title'] = 'Data Analyst'
salary_df.loc[(da_filter) & (lead_filter), 'job_title'] = 'Lead Data Analyst'
salary_df.loc[(da_filter) & (manager_filter), 'job_title'] = 'Data Analytics Manager'
salary_df[da_filter].job_title.unique()

array(['Data Analyst', 'Data Analytics Manager', 'Lead Data Analyst'],
      dtype=object)

In [10]:
senior_filter = salary_df.job_title.str.contains('Principal') | \
                    salary_df.job_title.str.contains('Lead') | \
                    salary_df.job_title.str.contains('Manag.*') | \
                    salary_df.job_title.str.contains('Head') | \
                    salary_df.job_title.str.contains('Director')                    
salary_df.loc[(de_filter) & (~senior_filter), 'job_title'] = 'Data Engineer'
salary_df[de_filter].job_title.unique()

array(['Data Engineer', 'Principal Data Engineer', 'Lead Data Engineer'],
      dtype=object)

In [11]:
salary_df.loc[(ml_filter) & (~senior_filter), 'job_title'] = 'Machine Learning Engineer'
salary_df[ml_filter].job_title.unique()

array(['Machine Learning Engineer', 'Principal Machine Learning Engineer',
       'Lead Machine Learning Engineer'], dtype=object)

In [12]:
salary_df.loc[(ds_filter) & (~senior_filter), 'job_title'] = 'Data Scientist'
salary_df[ds_filter].job_title.unique()

array(['Data Scientist', 'Director of Data Science',
       'Data Science Manager', 'Managing Director Data Science',
       'Data Science Lead', 'Data Science Director',
       'Principal Data Scientist', 'Head of Data Science',
       'Lead Data Scientist', 'Data Science Tech Lead',
       'Data Scientist Lead'], dtype=object)

In [13]:
salary_df.loc[(ds_filter) & (director_filter), 'job_title'] = 'Data Science Director'
salary_df.loc[(ds_filter) & (lead_filter), 'job_title'] = 'Lead Data Scientist'

salary_df[ds_filter].job_title.unique()

array(['Data Scientist', 'Data Science Director', 'Data Science Manager',
       'Lead Data Scientist', 'Principal Data Scientist',
       'Head of Data Science'], dtype=object)

In [14]:
architect_filter = salary_df.job_title.str.contains('Architect')
ai_filter = salary_df.job_title.str.contains('AI')
salary_df.loc[(architect_filter) & (~principal_filter) & (~ai_filter), 'job_title'] = 'Data Architect'
salary_df[architect_filter].job_title.unique()

array(['Data Architect', 'AI Architect', 'Principal Data Architect'],
      dtype=object)

In [15]:
salary_df.job_title.unique()

array(['Data Engineer', 'Data Scientist', 'BI Developer', 'Data Analyst',
       'Business Intelligence Developer', 'Data Science Director',
       'Machine Learning Engineer', 'Machine Learning Scientist',
       'Data Science Manager', 'Applied Scientist',
       'Business Intelligence Engineer', 'Research Scientist',
       'Research Engineer', 'AI Engineer', 'Data Specialist',
       'Data Architect', 'Data Visualization Specialist', 'ETL Developer',
       'Computer Vision Engineer', 'Data Lead', 'Data Developer',
       'Data Modeler', 'AI Architect', 'Data Analytics Manager',
       'Data Product Manager', 'Data Strategist', 'Prompt Engineer',
       'Lead Data Scientist', 'Business Intelligence Manager',
       'Data Manager', 'Lead Data Analyst', 'NLP Engineer',
       'AI Scientist', 'Machine Learning Researcher', 'Head of Data',
       'Machine Learning Modeler', 'Data Integration Specialist',
       'Data Management Specialist', 'AI Developer',
       'Business Intelligence

In [16]:
# continue to clean job titles