# Exploratory Data Analysis on Data Analytics Salaries

⌨ Jeran Burget

This analysis explores salaries of data analytics professionals around the world to find patterns in the data. Specifically, the goal is to determine which factors influence pay rates around the world and learn more about what a career path might look like for somebody starting out in Data Analytics.

## About the data
This data set comes from Kaggle user [randomarnab](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023) and contains information about various roles in data analytics from around the world. The data was gathered in 2023 and contains details about each role's experience level, job title, salary, remote ratio, company location, and company size.

In [1]:
import pandas as pd
df = pd.read_csv('data_analytics_salaries.csv')

## Analysis
The analysis below explores salaries of data analytics professionals. Specifically, it will explore the following different topics:

- How does experience level affect salary?
- How does experience level affect remote ratio?
- Which job titles are the most common in the United States and how does the job title affect salary?
- How have salaries changed between 2020 and 2022 for Data Analysts?
- Where are most data analytics positions located (according to this data set)? Which countries pay the most?
- What percent of employees are based in another country but are paid in USD?

One notable aspect of this data set is the presence of both `salary` and `salary_in_usd` columns. The former details the salary for the position in the local currency where the company is based, whereas the latter column standardizes all of the salaries into USD. Thus, this analysis will exclusively use the `salary_in_usd` column for comparisons.

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0,607.0,607.0
mean,303.0,2021.405272,324000.1,112297.869852,70.92257
std,175.370085,0.692133,1544357.0,70957.259411,40.70913
min,0.0,2020.0,4000.0,2859.0,0.0
25%,151.5,2021.0,70000.0,62726.0,50.0
50%,303.0,2022.0,115000.0,101570.0,100.0
75%,454.5,2022.0,165000.0,150000.0,100.0
max,606.0,2022.0,30400000.0,600000.0,100.0


### How does experience level affect salary?

At first glance, experience level seems to be the obvious candidate for the most influential variable in determining salary for data analytics professionals. This analysis assumes that the experiences levels are, in order from least amount of experience to greatest:

1. EN - Entry level
2. MI - Mid-level
3. SE - Senior level
4. EX - Executive level

According to the output of the code below, average salary tends to increase, as hypothesized, as experience level increases. However, these figures may be skewed because part-time salaries are included in the data set. Because part-time workers are more likely to be entry level and mid-level, the lower salaries of these positions (which are caused by working few hours) should be removed for this part of the analysis.

In [149]:
round(df[['experience_level', 'salary_in_usd']].groupby('experience_level').mean().sort_values(by='salary_in_usd'), 0)

Unnamed: 0_level_0,salary_in_usd
experience_level,Unnamed: 1_level_1
EN,61643.0
MI,87996.0
SE,138617.0
EX,199392.0


The code below creates a subset of the data that contains only positions that were full time. Recalculating the average salary for each experience level among this new subset brought the average salaries closer together only slightly. The change in average salary was most unnoticeable.

In [150]:
full_time_employees = df[df['employment_type'] == 'FT']
round(full_time_employees[['experience_level', 'salary_in_usd']].groupby('experience_level').mean().sort_values(by='salary_in_usd'), 0)

Unnamed: 0_level_0,salary_in_usd
experience_level,Unnamed: 1_level_1
EN,64457.0
MI,88403.0
SE,139021.0
EX,190728.0


From this analysis, I can conclude that experience is necessary to obtaining a higher salary. Salaries tend to vary greatly across different experience levels, meaning that experience is likely very influential for determining a person's salary.

### How does experience level affect remote ratio?
When determining how much employees are allowed to work remotely, I immediately think that senior employees are given more liberty to work from home than employees with less experience.

However, according to the results of the code below, it looks like exectives have the highest average remote ratio, beating senior employees by 2.953297.

Interestingly, entry-level employees have a higher remote ratio than mid-level employees.

In [151]:
round(df[['experience_level', 'remote_ratio']].groupby('experience_level').mean().sort_values('remote_ratio', ascending=False), 1)

Unnamed: 0_level_0,remote_ratio
experience_level,Unnamed: 1_level_1
EX,78.8
SE,75.9
EN,69.9
MI,63.8


The year 2022 saw an increase in the remote ratio for employees in several different experience levels. Which one saw the biggest increase in average remote ratio?

According to the code below, it looks like executives had a massive remote ratio increase compared to the other roles, while mid-level employees actually saw a decrease.

In [152]:
years_2021_and_2022_df = df[(df['work_year'] == 2021) | (df['work_year'] == 2022)]
round(years_2021_and_2022_df[['experience_level', 'work_year', 'remote_ratio']].groupby(['experience_level', 'work_year']).mean(), 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,remote_ratio
experience_level,work_year,Unnamed: 2_level_1
EN,2021,70.2
EN,2022,73.8
EX,2021,63.6
EX,2022,92.3
MI,2021,67.2
MI,2022,61.0
SE,2021,71.7
SE,2022,78.2


### Which job titles are the most common in the United States and how does the job title affect salary?



The code below shows the most common job titles in the US.

In [133]:
US_df = df[df['company_location'] == "US"]
US_df['job_title'].value_counts()

job_title
Data Engineer                               85
Data Scientist                              84
Data Analyst                                71
Machine Learning Engineer                   16
Data Science Manager                        10
Data Architect                               9
Data Analytics Manager                       7
BI Data Analyst                              5
Machine Learning Scientist                   5
Analytics Engineer                           4
Principal Data Scientist                     4
AI Scientist                                 4
Research Scientist                           4
Head of Data Science                         3
Lead Data Engineer                           3
Data Engineering Manager                     3
Applied Data Scientist                       3
Applied Machine Learning Scientist           3
Principal Data Engineer                      3
Computer Vision Software Engineer            2
Financial Data Analyst                       2
Hea

The code snippet below shows the median salary for data analysts in the United States.

In [156]:
US_data_analyst_df = df[(df['company_location'] == 'US') & (df['job_title'] == 'Data Analyst')]
round(US_data_analyst_df['salary_in_usd'].median(), 0)

106260.0

The code snippet below shows the standard deviation for salaries for machine learning engineers in the United States

In [154]:
US_machine_learning_engineers_df = df[(df['company_location'] == 'US') & (df['job_title'] == 'Machine Learning Engineer')]
round(US_machine_learning_engineers_df['salary_in_usd'].std(), 0)

44460.0

The code snippet below shows the average salary for data engineers in the United States

In [155]:
US_data_engineers_df = df[(df['company_location'] == 'US') & (df['job_title'] == 'Data Engineer')]
round(US_data_engineers_df['salary_in_usd'].mean(), 2)

139724.68

### How have salaries changed between 2020 and 2022 for Data Analysts?

It looks like the salaries for data analysts as a whole has changed quite a bit, seeing a little increase in 2021, with a large increase in 2022.

In [158]:
data_analysts_years_2020_through_2022_df = df[(df['job_title'] == 'Data Analyst') & (df['work_year'] >= 2020) & (df['work_year'] <= 2022)]
round(data_analysts_years_2020_through_2022_df[['work_year', 'salary_in_usd']].groupby('work_year').mean(), 0)

Unnamed: 0_level_0,salary_in_usd
work_year,Unnamed: 1_level_1
2020,45547.0
2021,79505.0
2022,100551.0


### Where are most data analytics positions located (according to this data set)? Which countries pay the most?

It looks like the large majority of data analytics positions are in the US, while the top salary countries are Russia, the US, and New Zealand.

In [159]:
company_location_and_salary_agg = {'company_location': 'count', 'salary_in_usd': 'mean'}
round(df[['company_location', 'salary_in_usd']].groupby('company_location').agg(company_location_and_salary_agg).sort_values(by='salary_in_usd', ascending=False), 0)

Unnamed: 0_level_0,company_location,salary_in_usd
company_location,Unnamed: 1_level_1,Unnamed: 2_level_1
RU,2,157500.0
US,355,144055.0
NZ,1,125000.0
IL,1,119059.0
JP,6,114127.0
AU,3,108043.0
AE,3,100000.0
DZ,1,100000.0
IQ,1,100000.0
CA,30,99824.0


### What percent of employees are based in another country but are paid in USD?
This is a tricky one.

To figure this out, I'll need to make a filter that gets out only employees that are in countries that are not the United States and whose salary currency is USD. Then, I can see how many rows that dataframe has and divide it by the number of rows in the original dataframe to get the answer.

According to this data set, 8.24% of employees are based in another country but are paid in USD.

In [173]:
other_country_usd_df = df[(df['company_location'] != 'US') & (df['salary_currency'] == 'USD')]
percentage_other_country_usd_df = (other_country_usd_df['Unnamed: 0'].count() / df['Unnamed: 0'].count()) * 100
round(percentage_other_country_usd_df, 2)

8.24

## Conclusion
Data analyst roles salaries increase based on both experience and location in the world, with the US being the where the majority of jobs are located. 