# Proposal

## Title: 

### Introduction

Data science and STEM careers are growing in popularity as more businesses leverage technological advances to solve challenges and streamline operations. Among them, students who are waiting for employment or employees who want to change careers are very curious about what can be the mean factors that impact the tracjectory of salaries and bonuses among the top 5 companies counts from our data. The purpose of the experiment was to determine the most relevant predictors from the data set to relate to the potential impact on a person's career in STEM and data science. This is to suggest that the job seekers can be more insightful and intentional with their development and career path. The dataset we will be working with is the "Data Science and STEM Salary" dataset from Kaggle, which contains information on numerous job titles and other criteria, and has 62,000 salary records from the leading organizations for this project. 

### Preliminary exploratory data analysis

First step of the preliminary exploratory data analysis is to clean and wrangle the data into tidy format. Therefore, before we start to write code for the process, we are going to load the library required for the data analysis.

In [2]:
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

The data we are going to use for this project is about **Data Science and STEM Salaries** taken from **Kaggle**, with the link https://www.kaggle.com/datasets/jackogozaly/data-science-and-stem-salaries
- Firstly, we download the data 'Levels_Fyi_Salary_Data.csv' from the link above. We can see the columns available for wrangling, the detail and brief explanation for each columns. These explanation will be important for dropping trivial columns for our analysis later on.
- Next, we upload our data to the Jupyter Notebook in the 'data' folder.
- After that, we opened the 'Levels_Fyi_Salary_Data.csv' to check that the data inside is in a comma-separated file format, with no header or additional information that might hinder the reading process.
- Finally, we read the data using 'read_csv' and assign it to a variable named salary_data, as follows.

In [4]:
salary_data <- read_csv("data/Levels_Fyi_Salary_Data.csv")
salary_data

[1mRows: [22m[34m62642[39m [1mColumns: [22m[34m29[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (10): timestamp, company, level, title, location, tag, gender, otherdeta...
[32mdbl[39m (19): totalyearlycompensation, yearsofexperience, yearsatcompany, basesa...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,⋯,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000,⋯,0,0,0,0,0,0,0,0,,
6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0,⋯,0,0,0,0,0,0,0,0,,
6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000,⋯,0,0,0,0,0,0,0,0,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
9/13/2018 14:35:59,MSFT,63,Software Engineer,220000,"Seattle, WA",14,12,Full Stack,157000,⋯,0,0,0,0,0,0,0,0,,
9/16/2018 16:10:35,Salesforce,Lead MTS,Software Engineer,280000,"San Francisco, CA",8,4,iOS,194688,⋯,0,0,0,0,0,0,0,0,,
1/29/2019 5:12:59,apple,ict3,Software Engineer,200000,"Sunnyvale, CA",0,0,ML / AI,155000,⋯,0,0,0,0,0,0,0,0,,


Now, we look at the data, and we immediately notice that it is not yet in tidy format.
Here are several things we need to 

In [5]:
top_5_company_count <- salary_data |> group_by(company) |> summarize(count=n()) |> arrange(desc(count)) |> head(5)
top_5_company_count

company,count
<chr>,<int>
Amazon,8126
Microsoft,5216
Google,4330
Facebook,2990
Apple,2028


In [6]:
all_role_title <- salary_data |> group_by(title) |> summarize(count=n())
all_role_title

title,count
<chr>,<int>
Business Analyst,885
Data Scientist,2578
Hardware Engineer,2200
⋮,⋮
Software Engineering Manager,3569
Solution Architect,1157
Technical Program Manager,1381


In [7]:
all_location <- salary_data |> group_by(location) |> summarize(count=n())
all_location

location,count
<chr>,<int>
"Aachen, NW, Germany",3
"Aarhus, AR, Denmark",5
"Aberdeen Proving Ground, MD",1
⋮,⋮
"Zaragoza, AR, Spain",3
"Zug, ZG, Switzerland",1
"Zurich, ZH, Switzerland",172


In [8]:
all_tags <- salary_data |> group_by(tag) |> summarize(count=n())
all_tags

tag,count
<chr>,<int>
--,5
??,2
.NET,1
⋮,⋮
YouTube,1
Z Systems,1
,808


### Method

Work experience, Gender, Base Salary, and Bonus are the quantitative variables that we will use to predict the categorical variable Company. Initially, we'll use a filter to remove all rows containing NA and zeros, then we'll select the necessary four predictors and a predictor column. The data is then divided into training and testing sets. The training set is used to create recipes and fit models, while the test set is used to predict outcomes and determine accuracy and workflow analysis. At the same time, we will visualize the prediction results and the accuracy vs k of each predictor variable combination to better compare the prediction effect and accuracy.



### Expected outcomes and significance
- What do you expect to find?
In this project, we are expecting to find the most relavent perdictors that can impact the trajectory of compensation. 
This can vary from the years of experience to the role titles.
- What impact could such findings have?
The impact is impactful to the job market and the trajectory of the career path for the people who are currently searching for jobs and people who are interested in learning more about their career development. Those perdictors can be something to consider, when they are researching for opportunities. 
- What future questions could this lead to?
As our dataset is not fully complete - some variables have high percentage of N/A, it will be helpful to refresh the survey in the near future to get more information. This can provide us with a clear understanding and stengthen the correlation of the variables. 

Can you predict role title based on one's experience?
For example: A person have 5 years of work exp. So this person is an engineer?

What if we predict compensation based on:
- yearsofexperience
- location
- role title
- yearsatcompany
- basesalary
- 

The point is, the only code we have to put is the code for cleaning and wrangling

RQ: Which predictor is the most impactful to predict the yearly compensation?
1. Filter out the column with no data or no way to quantify, example: gender, observation class, etc.
- Comment by a pessimistic: why do we not just filter out the NA instead of exclude the column?
2. Use all the remaining predictors to do a forward selection (as in 6.8.3)
3. Finally, make and tune a model to classify yearly compensation (?)
- At this point, since we know which predictors are relevant and which is not, we can just predict role title, and the question at the beginning will be nullified