# John Sreenan STEM Compensation Analysis

## Step 1: Load libraries

Load the libraries you need. You will probably want to load pandas and altair (or matplotlib). 

In [56]:
import pandas as pd
import numpy as np
import altair as alt

# Kaggle Data Science and STEM Salaries



## Background

This is a Kaggle dataset that records 62,000 salary records from top companies. The Data was scraped from levels.fyi and cleaning was provided before this analysis. As a Data scientist, I find this dataset very interesting because it provides information that is very relavent for job searching and picking a career. Since I only have 1.5 years left in my college education, I think this analysis will help me determine possible career paths for someone who is looking in the Data Science and Stem Job Market. Not only will it provide me with insight about what possible roles I want to look for, but it will also improve my chances of finding companies that are looking for people with my skillset. 

## Research Question

    How do compensation packages of a Data Scientist compare to other STEM fields?


### Link
https://www.kaggle.com/jackogozaly/data-science-and-stem-salaries

## Step 2: Read in the data

Download the data and load it in your notebook in a dataframe called `df`

In [57]:
# your code here
df = pd.read_csv('Levels_Fyi_Salary_Data.csv')
df
data = df.copy()

## Step 3: Investigate the data

In [58]:
data.head()

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000,...,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0,...,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000,...,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000,...,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0,...,0,0,0,0,0,0,0,0,,


In [59]:
data.tail()

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
62637,9/9/2018 11:52,Google,T4,Software Engineer,327000,"Seattle, WA",10.0,1.0,Distributed Systems (Back-End),155000,...,0,0,0,0,0,0,0,0,,
62638,9/13/2018 8:23,Microsoft,62,Software Engineer,237000,"Redmond, WA",2.0,2.0,Full Stack,146900,...,0,0,0,0,0,0,0,0,,
62639,9/13/2018 14:35,MSFT,63,Software Engineer,220000,"Seattle, WA",14.0,12.0,Full Stack,157000,...,0,0,0,0,0,0,0,0,,
62640,9/16/2018 16:10,Salesforce,Lead MTS,Software Engineer,280000,"San Francisco, CA",8.0,4.0,iOS,194688,...,0,0,0,0,0,0,0,0,,
62641,1/29/2019 5:12,apple,ict3,Software Engineer,200000,"Sunnyvale, CA",0.0,0.0,ML / AI,155000,...,0,0,0,0,0,0,0,0,,


In [60]:
data.dtypes

timestamp                   object
company                     object
level                       object
title                       object
totalyearlycompensation      int64
location                    object
yearsofexperience          float64
yearsatcompany             float64
tag                         object
basesalary                   int64
stockgrantvalue            float64
bonus                      float64
gender                      object
otherdetails                object
cityid                       int64
dmaid                      float64
rowNumber                    int64
Masters_Degree               int64
Bachelors_Degree             int64
Doctorate_Degree             int64
Highschool                   int64
Some_College                 int64
Race_Asian                   int64
Race_White                   int64
Race_Two_Or_More             int64
Race_Black                   int64
Race_Hispanic                int64
Race                        object
Education           

# Missing Data

In [61]:
data.isna().sum()

timestamp                      0
company                        5
level                        119
title                          0
totalyearlycompensation        0
location                       0
yearsofexperience              0
yearsatcompany                 0
tag                          854
basesalary                     0
stockgrantvalue                0
bonus                          0
gender                     19540
otherdetails               22505
cityid                         0
dmaid                          2
rowNumber                      0
Masters_Degree                 0
Bachelors_Degree               0
Doctorate_Degree               0
Highschool                     0
Some_College                   0
Race_Asian                     0
Race_White                     0
Race_Two_Or_More               0
Race_Black                     0
Race_Hispanic                  0
Race                       40215
Education                  32272
dtype: int64

In [62]:
data.columns

Index(['timestamp', 'company', 'level', 'title', 'totalyearlycompensation',
       'location', 'yearsofexperience', 'yearsatcompany', 'tag', 'basesalary',
       'stockgrantvalue', 'bonus', 'gender', 'otherdetails', 'cityid', 'dmaid',
       'rowNumber', 'Masters_Degree', 'Bachelors_Degree', 'Doctorate_Degree',
       'Highschool', 'Some_College', 'Race_Asian', 'Race_White',
       'Race_Two_Or_More', 'Race_Black', 'Race_Hispanic', 'Race', 'Education'],
      dtype='object')

In [63]:
# data.drop(['gender','tag','otherdetails','Race'], axis=1)

The first thing I wanted to investigate before looking at compensation packages was job popularity within this dataset. From the code below you can see that Software engineer is by far the most popular. 

In [64]:
data["title"].value_counts()

Software Engineer               41231
Product Manager                  4673
Software Engineering Manager     3569
Data Scientist                   2578
Hardware Engineer                2200
Product Designer                 1516
Technical Program Manager        1381
Solution Architect               1157
Management Consultant             976
Business Analyst                  885
Marketing                         710
Mechanical Engineer               490
Sales                             461
Recruiter                         451
Human Resources                   364
Name: title, dtype: int64

I wanted to create a new dataframe with all of the features that I considered to be related to job compensation packages. From this, I will be able to compare the averages of each feature for every job position.

In [65]:
data2 = data[["title","totalyearlycompensation","stockgrantvalue","basesalary"]]
data2

Unnamed: 0,title,totalyearlycompensation,stockgrantvalue,basesalary
0,Product Manager,127000,20000.0,107000
1,Software Engineer,100000,0.0,0
2,Product Manager,310000,0.0,155000
3,Software Engineering Manager,372000,180000.0,157000
4,Software Engineer,157000,0.0,0
...,...,...,...,...
62637,Software Engineer,327000,150000.0,155000
62638,Software Engineer,237000,73200.0,146900
62639,Software Engineer,220000,25000.0,157000
62640,Software Engineer,280000,57000.0,194688


In [66]:
rows, columns = data.shape
print("There are",rows,"rows and",columns,"columns")

There are 62642 rows and 29 columns


### What are the differences between Total Yearly Compensation, Stock Grant value, and Base Salary? What do they mean?


According to [Glassdoor](https://www.glassdoor.com/blog/guide/differences-between-total-compensation-and-base-salary/),

1. Total Yearly Compensation

"Total compensation refers to the entirety of benefit offerings that an employer provides to their employees in exchange for work." This will include your base salary, paid vacation, holidays, sick days, insurance, 401K and many other benefits. This will give me the big picture of a compensation package.

2. Stock Grant value

Stock Grants are a way for companies to encourage their employees for a set period of time. For example, a company might promise 100 shares of company stock to be payed out over 3 years. This will be helpful to look at because Stock Grants can have a number of advantages. One advantage is stock grants will always have some value because the employee did not spend money on them. Another advantage is, over the span of three years, if the stock of that company goes up, you will have much higher profit margins.

3. Base Salary

"Base salary refers to the initial rate of compensation that you get as an employee in exchange for performing your job’s duties and responsibilities." This can be thought of as your hourly rate. Base salary does not include other forms of compensation such as bonus or overtime. I think this is still very helpful to look at because it shows the overall industry standard for how much your job duties and responsibilies are worth.


In [67]:
# boolean_sub
boolean_sub_ds = (data2["title"] == "Data Scientist")
ds_df = data2[boolean_sub_ds]
# ds_df

boolean_sub_se = (data2["title"] == "Software Engineering Manager")
se_df = data2[boolean_sub_se]
# se_df

boolean_sub_sales = (data2["title"] == "Sales")
sales_df = data2[boolean_sub_sales]
# sales_df

## Step 4: Validate the data against another source

Data does not emerge from nowhere. It is deliberately collected and stored as part of social processes. Find some aspect of your data and validate it against another source. You will need to write some code and do some research for this. For instance, when we investigated the baseball database we found that certain players had played in dozens of all star games. We could validate that data by checking Wikipedia (another source) to see if the data seemed to accurately represent baseball history. You are highly encouraged to be creative in doing validation. For instance, if you are looking at data about high school graduation rates you might need to compute the mean graduation rate from your dataset and compare it to official graduation statistics provided by school district leaders in news articles. Do the statistics match? Are they close? What might account for discrepancies? To answer this question, you should write code and describe your research, linking to sources where appropriate.

https://www.salary.com/research/salary/benchmark/software-engineering-manager-salary - Average SEM Base Salary - 143,366

https://www.glassdoor.com/Salaries/manager-of-software-engineering-salary-SRCH_KO0,31.htm - Average SEM Base Salary - 144,311

https://www.payscale.com/research/US/Job=Software_Engineering_Manager/Salary - Average SEM Base Salary - 141,465

**Average: $143,047**

https://www.salary.com/research/salary/listing/data-scientist-salary - Average Data Scientist Base Salary - 134,352

https://www.glassdoor.com/Salaries/data-scientist-salary-SRCH_KO0,14.htm - Average Data Scientist Base Salary - 117,212

https://www.payscale.com/research/US/Job=Data_Scientist/Salary - Average Data Scientist Base Salary - 95,565

**Average: $115,710**

In [68]:

print("The Average Base Salary for a Software Engineer Manager is ${:,}".format(round(np.mean(se_df["basesalary"]))))

The Average Base Salary for a Software Engineer Manager is $174,204


In [69]:
print("The Average Base Salary for a Data Scientist is ${:,}".format(round(np.mean(ds_df["basesalary"]))))

The Average Base Salary for a Data Scientist is $138,055


In [70]:
data["company"].value_counts()

Amazon                      8126
Microsoft                   5216
Google                      4330
Facebook                    2990
Apple                       2028
                            ... 
Samsung research America       1
Bny Mellon                     1
yelp                           1
Bloomberg lp                   1
tableau software               1
Name: company, Length: 1631, dtype: int64

I wanted to validate my data by checking the base salaries of Software Engineer managers and data scientists because this portion of the data set is very important to my dataset. For both jobs, it appears that the kaggle dataset has slightly higher values for their base salaries. The dataset from kaggle was scraped from the website levels.fyi. while I was on their website I noticed that a lot of their information was location specefic. I would be curious to know if they included all of the US and Abroad jobs, or if they were only grabbed from certain regions. 

## Step 5: Make a plot

In this step, you should make some plot to investigate the data. For instance, while exploring the resume experiment in-class, we made a histogram to investigate the distribution of years of experience among the two groups in the experiment. What plot would help you understand your dataset? Any plot that illuminates something interesting about your dataset will be OK -- your focus should be less on getting the plot to appear on screen than on visually representing some interesting phenomenon in the data. 

Be sure to consider the aesthetics of your plot. For instance, do the plot axis show all of the relevant data? Can people read the labels on your plot? You might want to search for style guides or best practices for data visualization for things to consider for your particular kind of plot.

In [71]:
# your code here
   
test = np.mean(sales_df["basesalary"])

chart_comp = alt.Chart(data2.head(5000)).mark_bar().encode(
    x=alt.X('title', axis=alt.Axis(title='Job Title')),
    y=alt.Y("mean(totalyearlycompensation)", axis=alt.Axis(title='Average Yearly Compensation'))
)

chart_stock = alt.Chart(data2.head(5000)).mark_bar().encode(
    x=alt.X('title', axis=alt.Axis(title='Job Title')),
    y=alt.Y("mean(stockgrantvalue)", axis=alt.Axis(title='Average Stock Grant Value'))
)
chart_base = alt.Chart(data2.head(5000)).mark_bar().encode(
    x=alt.X('title', axis=alt.Axis(title='Job Title')),
    y=alt.Y("mean(basesalary)", axis=alt.Axis(title='Average Base Salary'))
)

rule1 = alt.Chart(data2.head(5000)).mark_rule(color='red').encode(
    y='mean(totalyearlycompensation)',
)

rule2 = alt.Chart(data2.head(5000)).mark_rule(color='red').encode(
    y='mean(stockgrantvalue):Q'
)
rule3 = alt.Chart(data2.head(5000)).mark_rule(color='red').encode(
    y='mean(basesalary):Q'
)



In [72]:
chart_comp + rule1

In [73]:
chart_stock + rule2

In [74]:
chart_base + rule3

## Step 6: Group and aggregate the data

Perform some operations to group and aggregate the data. You might use a pivot table to review the minumum of two groups. Or you might use a pandas `groupby` operation to compute the mean of different groups. How you perform these operations is up to you! But your operations should be motivated by an interesting question that is partially illuminated by your data exploration.


In [75]:
# your code here

# Grabbing average yearly compensation for a Data Scientist and Software Engineer.
# From the graph above you can see that Software Engineer had the highest average yearly compensation.
ds_year_sal = np.mean(ds_df["totalyearlycompensation"])
highest_avg_year_sal = np.mean(se_df["totalyearlycompensation"])

# Grabbing average Stock Grant Value for a Data Scientist and Software Engineer.
# From the graph above you can see that Software Engineer had the highest Stock Grant Value.
ds_stock = np.mean(ds_df["stockgrantvalue"])
highest_stock = np.mean(se_df["totalyearlycompensation"])

# Grabbing average Base Salary for a Data Scientist and Sales.
# From the graph above you can see that Data Scientist have the highest Base Salaries with Sales being second.
ds_base_sal = np.mean(ds_df["basesalary"])
sales_base_sal = np.mean(sales_df["basesalary"])

print('''The Average total yearly compensation 
for a software engineer is ${:,} and ${:,} for a data scientist'''.format(round(highest_avg_year_sal), round(ds_year_sal)))

print("This is a difference of ${:,} for average yearly compensation.\n".format(abs(round(highest_avg_year_sal-ds_year_sal))))

print('''The Average stock grant value for a software engineer manager
is ${:,} and ${:,} for a data scientist'''.format(round(highest_stock), round(ds_stock)))

print("This is a difference of ${:,} for average stock grant value.\n".format(abs(round(highest_stock-ds_stock))))

print('''The Average base salary for a sales job
is ${:,} and ${:,} for a data scientist'''.format(round(sales_base_sal), round(ds_base_sal)))

print("This is a difference of ${:,} for average base salary.\n".format(abs(round(sales_base_sal-ds_base_sal))))


The Average total yearly compensation 
for a software engineer is $354,636 and $203,657 for a data scientist
This is a difference of $150,979 for average yearly compensation.

The Average stock grant value for a software engineer manager
is $354,636 and $40,867 for a data scientist
This is a difference of $313,768 for average stock grant value.

The Average base salary for a sales job
is $118,471 and $138,055 for a data scientist
This is a difference of $19,584 for average base salary.



## Step 7: State questions and conclusions

At this point, you should have at least some sense of some properties of your data. What questions might your data help answer? What did you learn from the dataset that might answer these questions? Answer with a short paragraph below. Your explanation should link to external resources that validate your conclusions, or reflect them in a new light (e.g. news stories that describe anecdotal versions of what you have found from your analysis). 

# Conclusions
The biggest takeaway I got from this analysis came from base salary calulations. Although there was some confusion with my graph which initially led me to believe Sales had the highest base salary I knew this did not make sense. This is why I calculated the averages directly and compared them and found data science to be $19,584 more than Sales. From my research about base salary, I learned that this is an indication that these hard skills pay very well. It seems like a reliable profession if you just want to show up and do your job.  

# Questions / Improvements

The first question I had came while validating my dataset. How much of the [levels.fyi](https://www.levels.fyi/locations/) api did the kaggle dataset include. Were they location specific? if so, which locations were included?

Another thing that bothered me is I could not get my dataset to visually allign with the real numbers of the dataset. What do I need to fix in order to line up right? For example, the average base salary for sales is 118,471 but on the graph it is approaching  160,000.

Lastly, ways I could improve this analysis is to also look at the location and years of experience if I am trying to get a sense of the career trajectory of a datas cientist as well. 