# Exploratory Data Analysis on Data Analytics Salaries

⌨ Jordan Witzel

This analysis explores salaries of data analytics professionals around the world to find patterns in the data. Specifically, the goal is to determine which factors influence pay rates around the world and learn more about what a career path might look like for somebody starting out in Data Analytics.

## About the data
This data set comes from Kaggle user [randomarnab](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023) and contains information about various roles in data analytics from around the world. The data was gathered in 2023 and contains details about each role's experience level, job title, salary, remote ratio, company location, and company size.

In [None]:
import pandas as pd
df = pd.read_csv('/content/data_analytics_salaries.csv')

## Analysis
The analysis below explores salaries of data analytics professionals. Specifically, it will explore the following different topics:

- How does experience level affect salary?
- How does experience level affect remote ratio?
- Which job titles are the most common in the United States and how does the job title affect salary?
- How have salaries changed between 2020 and 2022 for Data Analysts?
- Where are most data analytics positions located (according to this data set)? Which countries pay the most?
- What percent of employees are based in another country but are paid in USD?

One notable aspect of this data set is the presence of both `salary` and `salary_in_usd` columns. The former details the salary for the position in the local currency where the company is based, whereas the latter column standardizes all of the salaries into USD.

###### Thus, this analysis will exclusively use the `salary_in_usd` column for comparisons.

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### How does experience level affect salary?
At first glance, experience level appears to be the most influential variable in determining salary for data analytics professionals. This analysis assumes that the experience levels are, in order from least to greatest amount of experience:

1. EN - Entry level
2. MI - Mid-level
3. SE - Senior level
4. EX - Executive level

According to the output of the code below, average salary tends to increase, as hypothesized, as experience level increases. Please note, these figures may be skewed as part-time salaries are included in the data set. Because part-time workers are more likely to be entry-level and mid-level, the lower salaries of these positions (often due to lower hours worked) should be removed in this part of the analysis.

In [None]:
# Experience Level and Avg Salary
df[['experience_level', 'salary_in_usd']].groupby('experience_level').mean().sort_values(by='salary_in_usd').round(2)

Below is a subset of the data that contains only full-time positions. Recalculating the average salary for each experience level among this new subset brought the average salaries closer together slightly. The change in average salary was minimal.

In [None]:
#FT, Entry-level
round(df[(df['employment_type'] == 'FT')& (df['experience_level'] == 'EN')]['salary_in_usd'].mean(),0)

From this analysis, I can conclude that experience is necessary to obtain a higher salary. Salaries vary greatly across different experience levels, inferring that experience level may be very influential when determining salaries.

### How does experience level affect remote ratio?
When determining how much employees are allowed to work remotely, its often assumed the higher the experience level, the higher remote to work ratio.

However, the code below shows us that Mid-level employees have the lowest remote ratio, currently at 63.8%, and even saw a 6% decrease from 2021 to 2022.

Entry, Executive, and Senior levels did have an increase in remote ratio for the year 2022. Executive employees saw the largest increase at about 28%, compared to just 4% for Entry-level employees.

In [None]:
round(df.groupby('experience_level')['remote_ratio'].mean(), 1).sort_values(ascending=False)

In [None]:
#2022 remote increase compared to 2021
wy21 = df[df['work_year'] == 2021 ]
wy22 = df[df['work_year'] == 2022 ]

In [None]:
print(wy21.groupby('experience_level')['remote_ratio'].mean().round(0))

print(wy22.groupby('experience_level')['remote_ratio'].mean().round(0))

### Which job titles are the most common in the United States and how do they affect salary?

In the US, Data Engineer, Data Scientist, and Data Analyst are amongst the most common job titles. These titles account for a wide variety of jobs in data-driven fields but each are distinct roles with varying salaries based on experience and location.



In [None]:
df[df['company_location'] != 'US']['job_title'].value_counts().head(7)

### How have salaries changed between 2020 and 2022 for Data Analysts?
In 2020, the average Data Analysts' salary was 45547.29.
By 2022, we see a large increase to the average salary, at 100550.74 - over a 55,000 increase.

In [None]:
DA20 = df[(df['job_title'] == 'Data Analyst') & (df['work_year'] == 2020)]['salary_in_usd'].mean().round(2)

In [None]:
DA21 = df[(df['job_title'] == 'Data Analyst') & (df['work_year'] == 2021)]['salary_in_usd'].mean().round(2)

In [None]:
DA22 = df[(df['job_title'] == 'Data Analyst') & (df['work_year'] == 2022)]['salary_in_usd'].mean().round(2)

In [None]:
print(DA20)
print(round(DA21, 1))
print(DA22)
print("salary change (2020 to 2022):", round(DA22-DA20, 2))

#### Additional insights from the data below:
- The median salary (USD) for data analysts in the United States is 106260. The average salary for data analysts in the US is 107674.2 pointing to few outliers in the data here.
- Standard deviation for salaries (USD) for Machine Learning Engineers in the US is 44,459 so we can assume there is a large range of salaries.
- The average salary (USD) for Data Engineers in the US is 139,724.7.



In [None]:
# median salary (USD) for data analysts in the United States
df[(df['job_title'] == 'Data Analyst') & (df['company_location'] == 'US')]['salary_in_usd'].median()

In [None]:
# standard deviation for salaries (USD) for machine learning engineers in the United States

round(df[(df['job_title'] == 'Machine Learning Engineer') & (df['company_location'] == 'US')]['salary_in_usd'].std(), 2)

In [None]:
# Average salary (USD) for data engineers in the United States. Rounded to the nearest cent.
print(round(df[(df['job_title'] == 'Data Engineer') & (df['company_location'] == 'US')]['salary_in_usd'].mean(), 1))

DE_avgSal = round(df[(df['job_title'] == 'Data Engineer') & (df['company_location'] == 'US')]['salary_in_usd'].mean(), 1)

In [None]:
DS_avgSal = round(df[(df['job_title'] == 'Data Scientist') & (df['company_location'] == 'US')]['salary_in_usd'].mean(), 1)
print(round(df[(df['job_title'] == 'Data Scientist') & (df['company_location'] == 'US')]['salary_in_usd'].mean(), 1))

In [None]:
print('Data Engineer and Data Scientist salary difference: $' , DS_avgSal - DE_avgSal)

### According to this data set, the most Data Analyst positions are in the US.

There are 71 Data Analyst positions in the US. Much more than any other country and is also the highest paying. CA is the runner-up with 9 Data Analyst positions.

If we broaden our job search to any 'Data' position, the US is still a leader with 311 postions but we do see a lot more variety from other locations like 43 positions in GB and 23 in CA.

In [None]:
df[df['job_title'] == 'Data Analyst']['company_location'].value_counts()

In [None]:
DA = df[df['job_title'] == 'Data Analyst']

In [None]:
DA.groupby('company_location')['salary_in_usd'].mean().sort_values(ascending= False)

In [None]:
# Top 3 countries for number of data positions: US, GB, CA
df[df['job_title'].str.contains('Data')]['company_location'].value_counts().head()

### What percent of employees are based in another country but are paid in USD?
This is a tricky one.

Here I've made a filter that pulls only employees that are not the US and whose salary currency is USD. Dividing the number of rows in this dataframe by the number of rows in the original dataframe tells us that 8.24 percent of employees are based in another country and paid in USD.

Of these, 30 positions are in CA and 24 are in IN. Salaries are still decent such as an average salary of 71665.5 in CN.

In [None]:
df.columns

In [None]:
df.head()

In [None]:
#What percent of employees work in a company that is based in another country but are paid in USD? 8.24%

OutsideUSD = df[(df['salary_currency'] == 'USD') & (df['company_location'] != 'US')].sort_values(by='company_location')

In [None]:
OutsideUSD.head()

In [None]:
round(len(OutsideUSD) / len(df) * 100, 2)

In [None]:
# Which country has 30 job openings available? CA
df.company_location.value_counts()
df.company_location.value_counts()[df['company_location'].value_counts() == 30]

In [None]:
df[df['company_location'] == 'IN'].shape[0]

In [None]:
#Avg USD salary for people working in China:
CN = df[(df['company_location'] == 'CN')]
CN['salary_in_usd'].mean()

In [None]:
#US Data Jobs by usd salary
df[df['job_title'].str.contains('Data')][['job_title' , 'company_location', 'salary_in_usd']].sort_values(by='salary_in_usd' , ascending= False).head(6)

In [None]:
#Data Jobs outside US by usd Salary.
OutsideUSD[OutsideUSD['job_title'].str.contains('Data')][['job_title' , 'company_location', 'salary_in_usd']].sort_values(by='salary_in_usd' , ascending= False).head(60)

In [None]:
data_jobs= df[df['job_title'].str.contains('Data')]

In [None]:
entry_level = data_jobs[data_jobs['experience_level'] == 'EN']

entry_level.groupby('company_location').agg(
    job_count=('job_title', 'count'),
    avg_salary=('salary_in_usd', 'mean')
).sort_values(by='job_count', ascending=False)

## Conclusion
With the data above, it's clear that chances for a job in data are better in the US with a drastically higher number of postions and types of positions.
Data jobs in the US have an average salary of 142,157 - And when look at just entry-level 'Data' positions, the US has an average salary of 82,309. Pretty decent for starting out. If not the US, CA or GB could also be a great fit based on the data above.  

Experience also plays a large role. We can see that with more experience, comes higher salary however, there may be factors to consider such as if you're looking for remote-focused roles, Mid-level experience positions might not fit your needs.

Overall, a job in the data-driven world can take you around the globe with a decent salary and a fair range of options and experience levels.