# Analyzing Tech Salaries Between Different Countries: Investigating Influential Factors

## Introduction
This research project explores variations and factors affecting tech salaries between different countries, leveraging a dataset from results of a 2016 Hacker News survey about salaries and bonuses. 

The data is obtained from [Kaggle Notebook](https://www.kaggle.com/datasets/thedevastator/know-your-worth-tech-salaries-in-2016), original source by Brandon Telle [[source]](https://data.world/brandon-telle/2016-hacker-news-salary-survey-results). This dataset includes details about salaries in the tech industry in 2016, including information such as employer names, locations, job titles, experience levels, and compensation details. 

The primary focus of this reserach is on analyzing the disparities in salaries across different locations internationally and understanding the potential factors influencing these variations. The research question guiding this investigation is: "How do salaries in the tech industry vary between different countries, and what factors contribute to these variations?" By analyzing this dataset, I aim to uncover patterns and correlations between salary and associated variables that can contribute to a more nuanced understanding of compensation trends within the tech sector.

Dependent variable (Y): Annual base pay is set to be the dependent variable. 

- *annual_base_pay*: The annual base pay is the fixed salary of the respondent, excluding any additional bonuses or benefits. It serves as the dependent variable in our analysis because it encapsulates the earnings received by individuals and is a key metric in understanding the global tech salary landscape. By excluding bonuses or other variable components, the analysis can focus specifically on the core earnings represented by the annual base pay. This allows me to analyze the fundamental salary structure and factors influencing salary disparities.

Independent variables (X): The following variables are set to be the independent variables: location (country), job category, employer name, and job title. These chosen independent variables collectively provide a comprehensive view of factors influencing annual base pay in the tech industry, directly addressing my research question on the variations in salaries between cities and the underlying factors contributing to these variations.
- *location_country*: The geographic location can significantly impact salary levels due to variations in living costs and demand for tech professionals. I aim to investigate how salaries differ between countries, which aligns with the research question, which focuses on exploring average salaries in various cities within the tech industry.
- *job_title_category*: The nature of the job, whether it's software, data, engineering, or management, can influence the salary level. Categorizing job roles helps examine how distinct fields in the tech industry contribute to salary variations. This variable is crucial for analysis as I seek to understand the impact of job categories on annual base pay.
- *employer_name*: The reputation and financial capabilities of the employer can influence salary levels. Larger, more established companies might offer higher salaries compared to smaller startups. Including employer_name as an independent variable allows me to explore the relationship between the reputation or size of the employer and the annual base pay received by the respondents. This adds a corporate dimension to my analysis, investigating how company reputation and stature affect compensation.
- *job_title*: The specific role held can directly impact salary levels as different job positions often come with distinct responsibilities and skill requirements. Analyzing the impact of job titles on annual base pay helps uncover the hierarchical and positional factors contributing to salary variations in the tech sector. By considering job_title as an independent variable, I aim to compare salaries between equivalent positions across different cities, which will allow for a more detailed breakdown of salary differences.
- *total_experience_years*: The total number of years of professional experience the respondent has accumulated is an expected influencing factor on salary because individuals with more experience tend to command higher pay due to their acquired skills, expertise, and experience in the industry. Examining the experience level can reflect the market value of an individual's expertise in different locations, providing a more comprehensive understanding of the dynamics influencing compensation in the tech industry.

## Data Cleaning

In [10]:
import numpy as np
import pandas as pd

In [46]:
#load in dataset
df = pd.read_csv("/Users/macychen/ECO225Project/Data/salaries_clean.csv", low_memory=False)

#display first five dataframe rows
df

Unnamed: 0,index,salary_id,employer_name,location_name,location_state,location_country,location_latitude,location_longitude,job_title,job_title_category,job_title_rank,total_experience_years,employer_experience_years,annual_base_pay,signing_bonus,annual_bonus,stock_value_bonus,comments,submitted_at
0,0,1,opower,"san francisco, ca",CA,US,37.77,-122.41,systems engineer,Engineering,,13.0,2.0,125000.0,5000.0,0.0,5000 shares,Don't work here.,3/21/16 12:58
1,1,3,walmart,"bentonville, ar",AR,US,36.36,-94.20,senior developer,Software,Senior,15.0,8.0,65000.0,,5000.0,3000,,3/21/16 12:58
2,2,4,vertical knowledge,"cleveland, oh",OH,US,41.47,-81.67,software engineer,Software,,4.0,1.0,86000.0,5000.0,6000.0,0,,3/21/16 12:59
3,3,6,netapp,waltham,MA,US,,,mts,Other,,4.0,0.0,105000.0,5000.0,8500.0,0,,3/21/16 13:00
4,4,12,apple,cupertino,CA,US,,,software engineer,Software,,4.0,3.0,110000.0,5000.0,7000.0,150000,,3/21/16 13:02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1650,1650,3289,sparkfun electronics,"boulder, co",CO,US,40.02,-105.25,junior software developer,Software,Junior,1.0,0.5,60500.0,0.0,800.0,0,,3/23/16 8:24
1651,1651,3290,intel,europe,,,,,staff software enginer,Software,,6.0,4.0,164000.0,0.0,20000.0,30000 USD,,3/23/16 8:27
1652,1652,3293,$2bn valuation tech company,new york city,NY,US,,,sr. frontend eng,Web,Senior,7.0,1.0,150000.0,0.0,0.0,0,,3/23/16 8:41
1653,1653,3294,of maryland,"college park, md",MD,US,38.99,-76.93,scientific programmer (faculty research assist...,Applied Science,,5.0,1.0,75000.0,,,,,3/23/16 8:43


In [47]:
#drop unneeded columns
new_df = df.drop(columns=["location_state", "job_title_rank", "employer_experience_years", "signing_bonus", "annual_bonus", "stock_value_bonus", "comments", "submitted_at"])

#display percentage of missing data (null value) in each column: number of NaNs/total number of index
round(100*(new_df.isnull().sum()/len(new_df.index)), 2)

index                      0.00
salary_id                  0.00
employer_name              0.24
location_name              0.00
location_country           4.59
location_latitude         52.15
location_longitude        52.15
job_title                  0.00
job_title_category         0.00
total_experience_years     2.84
annual_base_pay            0.24
dtype: float64

In [48]:
#drop rows that have missing values for annual base pay, employer, total_experience_years and location_country
new_df.dropna(subset=['employer_name', 'annual_base_pay', 'total_experience_years', 'location_country'], inplace=True)

new_df.head()

Unnamed: 0,index,salary_id,employer_name,location_name,location_country,location_latitude,location_longitude,job_title,job_title_category,total_experience_years,annual_base_pay
0,0,1,opower,"san francisco, ca",US,37.77,-122.41,systems engineer,Engineering,13.0,125000.0
1,1,3,walmart,"bentonville, ar",US,36.36,-94.2,senior developer,Software,15.0,65000.0
2,2,4,vertical knowledge,"cleveland, oh",US,41.47,-81.67,software engineer,Software,4.0,86000.0
3,3,6,netapp,waltham,US,,,mts,Other,4.0,105000.0
4,4,12,apple,cupertino,US,,,software engineer,Software,4.0,110000.0


## Summary Statistics Tables

In [41]:
new_df.describe()

Unnamed: 0,index,salary_id,location_latitude,location_longitude,total_experience_years,annual_base_pay
count,1533.0,1533.0,765.0,765.0,1533.0,1533.0
mean,824.020222,1678.39987,37.71319,-64.128654,6.596758,240443.5
std,475.384112,927.914798,16.730984,67.634208,5.295233,4008983.0
min,0.0,1.0,-41.0,-123.27,0.0,0.0
25%,414.0,898.0,37.41,-104.81,3.0,61000.0
50%,825.0,1709.0,38.67,-95.0,5.0,99190.0
75%,1232.0,2462.0,45.44,-64.0,10.0,130000.0
max,1654.0,3298.0,65.0,174.0,40.0,156000000.0


In [44]:
new_df.describe(include=['object'])

Unnamed: 0,employer_name,location_name,location_country,job_title,job_title_category
count,1533,1533,1533,1533,1533
unique,991,568,65,616,8
top,google,san francisco,US,software engineer,Software
freq,60,143,1077,291,834


## Plots, Histograms, Figures