# Study of Tech Company Salaries

![techCompanies](tech_logos.png) 

This notebook takes a look into a dataset populated with information on tech employees working at various companies.  Our mission is to find and measure variables that may have an impact on the salary of an individual.

In [442]:
# Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

In [483]:
# Load the dataset
df = pd.read_csv('tech_companies_salary.csv')
# Display the first few rows of the dataframe
print(df.head())
# Display summary information about the dataframe
df.info

            timestamp    company level                         title  \
0    06-07-2017 11:33     Oracle    L3               Product Manager   
1    06-10-2017 17:11       eBay  SE 2             Software Engineer   
2    06-11-2017 14:53     Amazon    L7               Product Manager   
3   6/17/2017 0:23:14      Apple    M1  Software Engineering Manager   
4  6/20/2017 10:58:51  Microsoft    60             Software Engineer   

   totalyearlycompensation           location  yearsofexperience  \
0                   127000   Redwood City, CA                1.5   
1                   100000  San Francisco, CA                5.0   
2                   310000        Seattle, WA                8.0   
3                   372000      Sunnyvale, CA                7.0   
4                   157000  Mountain View, CA                5.0   

   yearsatcompany  tag  basesalary  ...  Doctorate_Degree  Highschool  \
0             1.5  NaN      107000  ...                 0           0   
1           

<bound method DataFrame.info of                 timestamp     company     level                         title  \
0        06-07-2017 11:33      Oracle        L3               Product Manager   
1        06-10-2017 17:11        eBay      SE 2             Software Engineer   
2        06-11-2017 14:53      Amazon        L7               Product Manager   
3       6/17/2017 0:23:14       Apple        M1  Software Engineering Manager   
4      6/20/2017 10:58:51   Microsoft        60             Software Engineer   
...                   ...         ...       ...                           ...   
62637    09-09-2018 11:52      Google        T4             Software Engineer   
62638   9/13/2018 8:23:32   Microsoft        62             Software Engineer   
62639  9/13/2018 14:35:59        MSFT        63             Software Engineer   
62640  9/16/2018 16:10:35  Salesforce  Lead MTS             Software Engineer   
62641   1/29/2019 5:12:59       apple      ict3             Software Engineer

In [484]:
# Number of null values in each column
df.isna().sum()

timestamp                      0
company                        5
level                        123
title                          0
totalyearlycompensation        0
location                       0
yearsofexperience              0
yearsatcompany                 0
tag                          870
basesalary                     0
stockgrantvalue                0
bonus                          0
gender                     19540
otherdetails               22508
cityid                         0
dmaid                          2
rowNumber                      0
Masters_Degree                 0
Bachelors_Degree               0
Doctorate_Degree               0
Highschool                     0
Some_College                   0
Race_Asian                     0
Race_White                     0
Race_Two_Or_More               0
Race_Black                     0
Race_Hispanic                  0
Race                       40215
Education                  32272
dtype: int64

# Drop data with null values

Dropping the data of the gender, education, company, race, tag, and level columns, cleans up the data in order to look at the relationship of these variables on salaries.  These values cannot be replaced without potentially skewing the data.

In [485]:
# Remove rows with null values in subset columns
df = df.dropna(subset=['gender', 'Education', 'company', 'Race', 'tag', 'level'])
# Verify removal
print(df.isna().sum())

timestamp                  0
company                    0
level                      0
title                      0
totalyearlycompensation    0
location                   0
yearsofexperience          0
yearsatcompany             0
tag                        0
basesalary                 0
stockgrantvalue            0
bonus                      0
gender                     0
otherdetails               0
cityid                     0
dmaid                      0
rowNumber                  0
Masters_Degree             0
Bachelors_Degree           0
Doctorate_Degree           0
Highschool                 0
Some_College               0
Race_Asian                 0
Race_White                 0
Race_Two_Or_More           0
Race_Black                 0
Race_Hispanic              0
Race                       0
Education                  0
dtype: int64


## Updated Summary Statistics

In [486]:
df.size

623935

In [487]:
df.describe()

Unnamed: 0,totalyearlycompensation,yearsofexperience,yearsatcompany,basesalary,stockgrantvalue,bonus,cityid,dmaid,rowNumber,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic
count,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0,21515.0
mean,197950.8,7.127167,2.706995,133894.538694,44974.392424,18401.301418,10177.354218,561.242436,59207.188891,0.421055,0.506577,0.042854,0.013944,0.01571,0.528887,0.355612,0.035417,0.030769,0.049361
std,133131.3,5.848876,3.328438,57231.645682,72523.732291,24802.564405,7677.552448,315.74473,14561.633489,0.49374,0.499968,0.202532,0.11726,0.124354,0.499176,0.47871,0.184836,0.172696,0.216625
min,10000.0,0.0,0.0,4000.0,0.0,0.0,10.0,0.0,21208.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,119000.0,3.0,0.0,100000.0,0.0,3000.0,7322.0,501.0,47070.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,174000.0,6.0,2.0,135000.0,20000.0,13000.0,8198.0,751.0,59849.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,245000.0,10.0,4.0,165000.0,55000.0,25000.0,11521.0,807.0,71599.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
max,4980000.0,45.0,40.0,900000.0,954000.0,900000.0,47926.0,881.0,83875.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Subsetting dataframe to remove redundant columns and display the first five rows

In [488]:
df = df[['totalyearlycompensation', 'basesalary', 'bonus', 'company', 'cityid', 'location', 'level', 'title', 'tag', 'yearsofexperience',
        'yearsatcompany', 'stockgrantvalue', 'dmaid', 'rowNumber', 'Race', 'Education', 'gender']]
df.head()

Unnamed: 0,totalyearlycompensation,basesalary,bonus,company,cityid,location,level,title,tag,yearsofexperience,yearsatcompany,stockgrantvalue,dmaid,rowNumber,Race,Education,gender
15710,400000,210000,45000.0,Google,7472,"Sunnyvale, CA",L6,Software Engineer,Distributed Systems (Back-End),5.0,5.0,145000.0,807.0,21208,Asian,PhD,Male
23532,136000,124000,11000.0,Microsoft,11521,"Redmond, WA",61,Software Engineer,DevOps,3.0,2.0,1000.0,819.0,32237,Two Or More,Bachelor's Degree,Male
23533,337000,177000,36000.0,Google,7413,"San Bruno, CA",L5,Software Engineer,Full Stack,6.0,6.0,125000.0,807.0,32239,Asian,Bachelor's Degree,Male
23534,222000,164000,20000.0,Microsoft,11527,"Seattle, WA",62,Software Engineer,API Development (Back-End),4.0,4.0,38000.0,819.0,32240,Asian,Master's Degree,Male
23535,187000,165000,0.0,Blend,7419,"San Francisco, CA",IC3,Software Engineer,Full Stack,5.0,0.0,22000.0,807.0,32241,White,Bachelor's Degree,Male
23537,310000,160000,0.0,Amazon,11527,"Seattle, WA",L6,Software Engineer,ML / AI,15.0,3.0,150000.0,819.0,32243,Asian,Bachelor's Degree,Male
23538,113000,103000,10000.0,Chevron,11109,"Houston, TX",PSG 20,Software Engineer,DevOps,3.0,3.0,0.0,618.0,32244,Hispanic,Bachelor's Degree,Male
23540,620000,160000,0.0,Amazon,11527,"Seattle, WA",L7,Software Engineering Manager,Full Stack,19.0,7.0,460000.0,819.0,32247,Asian,Bachelor's Degree,Male
23541,98000,78000,0.0,Shopify,1206,"Toronto, ON, Canada",L6,Software Engineer,Web Development (Front-End),9.0,4.0,20000.0,0.0,32248,Asian,Bachelor's Degree,Male
23543,180000,130000,20000.0,Apple,1320,"Vancouver, BC, Canada",ICT3,Software Engineer,ML / AI,1.0,1.0,30000.0,0.0,32250,Asian,Bachelor's Degree,Male


## Method to split the location column into city and state columns.

Locations with more than 2 strings delimited by a comma will be set to null values.  These locations correspond to locations outside of the United States.

Limiting our scope of the data to within the U.S. will increase the accuracy and usefulness of our model, as well as eliminate some outliers in the data.

In [489]:
def split_location(loc):
    location = loc.split(', ')  
    if len(location) == 2:
        city = location[0]
        state = location[1]
    else:
        city = None
        state = None
    return pd.Series([city, state])

df[['City', 'State']] = df['location'].apply(split_location)

Unnamed: 0,totalyearlycompensation,basesalary,bonus,company,cityid,location,level,title,tag,yearsofexperience,yearsatcompany,stockgrantvalue,dmaid,rowNumber,Race,Education,gender,City,State
15710,400000,210000,45000.0,Google,7472,"Sunnyvale, CA",L6,Software Engineer,Distributed Systems (Back-End),5.0,5.0,145000.0,807.0,21208,Asian,PhD,Male,Sunnyvale,CA
23532,136000,124000,11000.0,Microsoft,11521,"Redmond, WA",61,Software Engineer,DevOps,3.0,2.0,1000.0,819.0,32237,Two Or More,Bachelor's Degree,Male,Redmond,WA
23533,337000,177000,36000.0,Google,7413,"San Bruno, CA",L5,Software Engineer,Full Stack,6.0,6.0,125000.0,807.0,32239,Asian,Bachelor's Degree,Male,San Bruno,CA
23534,222000,164000,20000.0,Microsoft,11527,"Seattle, WA",62,Software Engineer,API Development (Back-End),4.0,4.0,38000.0,819.0,32240,Asian,Master's Degree,Male,Seattle,WA
23535,187000,165000,0.0,Blend,7419,"San Francisco, CA",IC3,Software Engineer,Full Stack,5.0,0.0,22000.0,807.0,32241,White,Bachelor's Degree,Male,San Francisco,CA


### Restructuring of the data frame for readability with added state and city columns

In [490]:
df = df[['totalyearlycompensation', 'basesalary', 'bonus', 'company', 'cityid', 'location', 'State', 'City', 'level', 'title', 'tag', 'yearsofexperience',
        'yearsatcompany', 'stockgrantvalue', 'dmaid', 'rowNumber', 'Race', 'Education', 'gender']]
df.head()

Unnamed: 0,totalyearlycompensation,basesalary,bonus,company,cityid,location,State,City,level,title,tag,yearsofexperience,yearsatcompany,stockgrantvalue,dmaid,rowNumber,Race,Education,gender
15710,400000,210000,45000.0,Google,7472,"Sunnyvale, CA",CA,Sunnyvale,L6,Software Engineer,Distributed Systems (Back-End),5.0,5.0,145000.0,807.0,21208,Asian,PhD,Male
23532,136000,124000,11000.0,Microsoft,11521,"Redmond, WA",WA,Redmond,61,Software Engineer,DevOps,3.0,2.0,1000.0,819.0,32237,Two Or More,Bachelor's Degree,Male
23533,337000,177000,36000.0,Google,7413,"San Bruno, CA",CA,San Bruno,L5,Software Engineer,Full Stack,6.0,6.0,125000.0,807.0,32239,Asian,Bachelor's Degree,Male
23534,222000,164000,20000.0,Microsoft,11527,"Seattle, WA",WA,Seattle,62,Software Engineer,API Development (Back-End),4.0,4.0,38000.0,819.0,32240,Asian,Master's Degree,Male
23535,187000,165000,0.0,Blend,7419,"San Francisco, CA",CA,San Francisco,IC3,Software Engineer,Full Stack,5.0,0.0,22000.0,807.0,32241,White,Bachelor's Degree,Male


### Verifying results of creating new columns and dropping data with null values in state column.

Dropping these results further eliminates international entries.

In [491]:
df.State.unique()

array(['CA', 'WA', 'TX', None, 'AZ', 'NY', 'IL', 'MA', 'NC', 'VA', 'CO',
       'FL', 'WI', 'DC', 'OR', 'MI', 'MN', 'MO', 'PA', 'UT', 'NJ', 'TN',
       'LA', 'GA', 'AR', 'IA', 'OH', 'MD', 'IN', 'MT', 'RI', 'DE', 'OK',
       'NV', 'CT', 'AL', 'SC', 'KS', 'KY', 'WV', 'MS', 'ID', 'NE', 'HI',
       'NH', 'NM', 'ND', 'VT', 'WY', 'ME'], dtype=object)

In [492]:
# Remove rows with null values in State column
df = df.dropna(subset='State')
# Verify removal
print(df.isna().sum())

totalyearlycompensation    0
basesalary                 0
bonus                      0
company                    0
cityid                     0
location                   0
State                      0
City                       0
level                      0
title                      0
tag                        0
yearsofexperience          0
yearsatcompany             0
stockgrantvalue            0
dmaid                      0
rowNumber                  0
Race                       0
Education                  0
gender                     0
dtype: int64


### Verification that columns contain unique and non-redundant values

In [495]:
df.Education.unique()

array(['PhD', "Bachelor's Degree", "Master's Degree", 'Some College',
       'Highschool'], dtype=object)

In [496]:
df.Race.unique()

array(['Asian', 'Two Or More', 'White', 'Hispanic', 'Black'], dtype=object)

In [497]:
df.title.unique()

array(['Software Engineer', 'Software Engineering Manager',
       'Hardware Engineer', 'Product Designer', 'Management Consultant',
       'Product Manager', 'Solution Architect', 'Sales',
       'Technical Program Manager', 'Data Scientist', 'Recruiter',
       'Mechanical Engineer', 'Business Analyst', 'Human Resources',
       'Marketing'], dtype=object)

### Formatting of the company category eliminates redundant values

In [498]:
df['company'] = df['company'].apply(lambda company: company.upper())

companies = df.company.unique()
companies_edit = []
for company in companies:
    companies_edit.append(company.upper())
companies_edit.sort()
for company in companies_edit:
    print(company)

### Exporting dataframe for external use and backup purposes

In [500]:
df.to_csv('updated_base_df.csv')

### Importing CSV loaded with stock market information to merge with original dataframe.

In [461]:
stocks_df = pd.read_csv('screener-stocks.csv')
stocks_df.head()

Unnamed: 0,Symbol,Company Name,Market Cap,Stock Price,% Change,Industry,PE Ratio,Ent. Value,MC Group,Sector,...,Oper. Margin,Pretax Margin,Profit Margin,R&D / Rev,Avg. Volume,Rel. Volume,RSI,Tax / Revenue,Rev Gr. This Q,Rev Gr. Next Q
0,FLWS,1-800-FLOWERS.COM,696569800.0,10.92,1.11%,Specialty Retail,,702474800.0,Small-Cap,Consumer Discretionary,...,-2.89%,-3.39%,-3.24%,3.20%,420598.0,77.96%,59.81,-0.15%,-5.95%,-3.19%
1,VCXB,10X Capital Venture Acquisition Corp. III,142102600.0,10.77,,Shell Companies,63.35,142077300.0,Micro-Cap,Financials,...,,,,,10038.0,0.12%,60.96,,,
2,TXG,10x Genomics,4316825000.0,36.89,1.77%,Health Information Services,,4023500000.0,Mid-Cap,Healthcare,...,-42.88%,-40.21%,-41.23%,43.69%,1403080.0,74.62%,34.24,1.02%,7.41%,8.17%
3,YI,"111, Inc.",87352930.0,1.01,-2.88%,Pharmaceutical Retailers,,38840450.0,Micro-Cap,Healthcare,...,-2.30%,-2.58%,-2.63%,0.82%,81508.0,102.12%,27.8,0.00%,,
4,YQ,17 Education & Technology Group,30658290.0,2.97,,Education & Training Services,,-32903210.0,Nano-Cap,Consumer Staples,...,-196.68%,-178.90%,-182.37%,96.36%,18790.0,16.56%,54.54,,,


### Creating a company column with reformatted company names creates a column that can merge the datasets.

Removal of the reference to ticker GOOGL, in order to merge data corresponding to ticker GOOG.

In [462]:
stocks_df['company'] = stocks_df['Company Name'].apply(lambda company: company.upper())
stocks_df.drop(stocks_df.loc[stocks_df['Symbol']=='GOOGL'].index, inplace=True)

### Reassignment of company names in original dataset to corresponding company names in stock dataset.

In [464]:
df.loc[ df['company'] == 'GOOGLE', 'company'] = 'ALPHABET'
df.loc[ df['company'] == 'AMAZON', 'company'] = 'AMAZON.COM'
df.loc[ df['company'] == 'FACEBOOK', 'company'] = 'META PLATFORMS'
df.loc[ df['company'] == 'PAYPAL', 'company'] = 'PAYPAL HOLDINGS'
df.loc[ df['company'] == 'HSBC', 'company'] = 'HSBC HOLDINGS'
df.loc[ df['company'] == 'APPLE INC.', 'company'] = 'APPLE'
df.loc[ df['company'] == '2U', 'company'] = '2U, INC.'
df.loc[ df['company'] == 'ABBOTT', 'company'] = 'ABBOTT LABORATORIES'
df.loc[ df['company'] == '8X8', 'company'] = '8X8, INC.'
df.loc[ df['company'] == 'ADP', 'company'] = 'Automatic Data Processing'.upper()
df.loc[ df['company'] == 'AFFIRM', 'company'] = 'AFFIRM HOLDINGS'
df.loc[ df['company'] == 'AMD', 'company'] = 'ADVANCED MICRO DEVICES'
df.loc[ df['company'] == 'INTEL CORPORATION', 'company'] = 'INTEL'
df.loc[ df['company'] == 'IBM', 'company'] = 'INTERNATIONAL BUSINESS MACHINES'
df.loc[ df['company'] == 'CAPITAL ONE', 'company'] = 'CAPITAL ONE FINANCIAL'
df.loc[ df['company'] == 'CISCO', 'company'] = 'CISCO SYSTEMS'
df.loc[ df['company'] == 'JPMORGAN CHASE', 'company'] = 'JPMORGAN CHASE & CO.'
df.loc[ df['company'] == 'JP MORGAN CHASE', 'company'] = 'JPMORGAN CHASE & CO.'
df.loc[ df['company'] == 'JP MORGAN', 'company'] = 'JPMORGAN CHASE & CO.'
df.loc[ df['company'] == 'UBER', 'company'] = 'UBER TECHNOLOGIES'
df.loc[ df['company'] == 'GOLDMAN SACHS', 'company'] = 'THE GOLDMAN SACHS GROUP'
df.loc[ df['company'] == 'WALMART LABS', 'company'] = 'WALMART'
df.loc[ df['company'] == 'EBAY', 'company'] = 'EBAY INC.'
df.loc[ df['company'] == 'VISA', 'company'] = 'VISA INC.'
df.loc[ df['company'] == 'LYFT', 'company'] = 'LYFT, INC.'
df.loc[ df['company'] == 'SAP', 'company'] = 'SAP SE'
df.loc[ df['company'] == 'DELL', 'company'] = 'DELL TECHNOLOGIES'
df.loc[ df['company'] == 'BOEING', 'company'] = 'THE BOEING COMPANY'
df.loc[ df['company'] == 'GENERAL MOTORS', 'company'] = 'GENERAL MOTORS COMPANY'
df.loc[ df['company'] == 'T-MOBILE', 'company'] = 'T-MOBILE US'

### Initial merge of the data sets

In [465]:
agg_df = pd.merge(df, stocks_df, on="company", how="left")
print(agg_df['Symbol'].isna().sum())

5093


In [502]:
counts = agg_df['company'].value_counts()
print(counts.head(10))

company
AMAZON.COM                         2090
MICROSOFT                          1299
ALPHABET                           1104
META PLATFORMS                      867
APPLE                               569
ORACLE                              309
INTEL                               303
INTERNATIONAL BUSINESS MACHINES     277
CAPITAL ONE FINANCIAL               267
CISCO SYSTEMS                       262
Name: count, dtype: int64


### Eliminating Remaining Null Values

To eliminate the remaining null values of the symbol column, the entries were researched and a symbol was manually entered into the dataset. Private companies were assigned "Private", while ambiguous and government companies were left null.  The latter to be dropped.

This dataset was then merged on the "Symbol" column to fill in missing values and the dataset was saved as final_aggregated_df.csv.

Column names were updated in order to follow a uniform naming convention.

### Import of aggregated dataset

In [560]:
df = pd.read_csv('final_aggregated_df.csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,total_yearly_compensation,base_salary,bonus,company,city_id,location,state,city,...,pretax_margin,profit_margin,research_development_over_revenue,average_volume,relative_volume,rsi,tax_over_revenue,revenue_growth_this_quarter,revenue_growth_next_quarter,gender
0,5928,0,400000,210000,45000,ALPHABET,7472,"Sunnyvale, CA",CA,Sunnyvale,...,27.89%,24.01%,14.78%,22956359.0,72.27%,64.81,3.88%,16.00%,14.35%,Male
1,10063,1,136000,124000,11000,MICROSOFT,11521,"Redmond, WA",WA,Redmond,...,44.47%,36.27%,12.09%,22842031.0,72.63%,56.25,8.20%,17.27%,17.13%,Male
2,5929,2,337000,177000,36000,ALPHABET,7413,"San Bruno, CA",CA,San Bruno,...,27.89%,24.01%,14.78%,22956359.0,72.27%,64.81,3.88%,16.00%,14.35%,Male
3,10064,3,222000,164000,20000,MICROSOFT,11527,"Seattle, WA",WA,Seattle,...,44.47%,36.27%,12.09%,22842031.0,72.63%,56.25,8.20%,17.27%,17.13%,Male
4,3787,4,187000,165000,0,BLEND,7419,"San Francisco, CA",CA,San Francisco,...,-113.87%,-118.15%,52.02%,1863929.0,43.24%,56.52,0.06%,-5.65%,-0.47%,Male


In [561]:
df.shape

(16903, 92)

### Columns of the new dataset

In [562]:
for each in df.columns:
    print(each)

Unnamed: 0.1
Unnamed: 0
total_yearly_compensation
base_salary
bonus
company
city_id
location
state
city
level
title
tag
years_of_experience
years_at_company
stock_grant_value
dmaid
row_number
race
education
symbol
company_name
market_cap
stock_price
% Change
industry
pe_ratio
Ent. Value
mc_group
sector
forward_pe
country
state_of_hq
employees
employee_change
employee_growth
revenue
rev_growth_3y
rev_growth_5y
rev_growth_yrs
gross_profit
gross_profit_growth
gross_profit_growth_quarter
gross_profit_growth_5Y
gross_profit_growth_3y
net_income
operating_income
operating_income_growth
operating_income_growth_quarter
operating_income_growth_3y
operating_income_growth_5y
net_income_growth
net_income_growth_quarter
net_income_growth_3y
net_income_growth_5y
net_income_growth_years
net_income_growth_quarters
total_cash
total_debt
revenue_per_employee
profit_per_employee
working_capital
equity
net_working_capital
revenue_growth_this_year
revenue_growth_next_5y
revenue_growth_next_year
revenue_gro

# Subsetting and Cleaning Workable Dataset

## Drop entries where 'symbol' is null
This removes ambiguous and government owned companies

In [563]:
print(df['symbol'].isna().sum())

126


In [564]:
df = df.dropna(subset='symbol')
print(df['symbol'].isna().sum())

0


## Subset data into International, Private, and Public Companies

In [565]:
international_companies = df.loc[df['symbol'] == 'INTERNATIONAL']
international_companies.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,total_yearly_compensation,base_salary,bonus,company,city_id,location,state,city,...,pretax_margin,profit_margin,research_development_over_revenue,average_volume,relative_volume,rsi,tax_over_revenue,revenue_growth_this_quarter,revenue_growth_next_quarter,gender
51,8099,51,107000,107000,5000,FUJITSU,7472,"Sunnyvale, CA",CA,Sunnyvale,...,,,,,,,,,,Male
150,8138,150,325000,150000,30000,INDEED,10965,"Austin, TX",TX,Austin,...,,,,,,,,,,Male
239,8068,239,40000,40000,1000,CAPGEMINI,18625,"Italy, TX",TX,Italy,...,,,,,,,,,,Male
243,8319,243,105000,100000,5000,SWISS RE,10182,"New York, NY",NY,New York,...,,,,,,,,,,Male
277,8257,277,73000,71000,2000,TATA CONSULTANCY SERVICES,11039,"Dallas, TX",TX,Dallas,...,,,,,,,,,,Male


In [566]:
international_companies.shape

(288, 92)

In [567]:
private_companies = df.loc[df['symbol'] == 'PRIVATE']
private_companies.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,total_yearly_compensation,base_salary,bonus,company,city_id,location,state,city,...,pretax_margin,profit_margin,research_development_over_revenue,average_volume,relative_volume,rsi,tax_over_revenue,revenue_growth_this_quarter,revenue_growth_next_quarter,gender
14,14082,14,365000,187000,38000,TWITTER,10182,"New York, NY",NY,New York,...,,,,,,,,,,Male
20,14446,20,300000,160000,40000,TANIUM,8198,"Chicago, IL",IL,Chicago,...,,,,,,,,,,Male
22,12717,22,493000,215000,54000,BYTEDANCE,7322,"Mountain View, CA",CA,Mountain View,...,,,,,,,,,,Female
23,14475,23,190000,180000,10000,TICKETMASTER,7275,"Los Angeles, CA",CA,Los Angeles,...,,,,,,,,,,Male
29,13501,29,266000,246000,20000,KPMG,10965,"Austin, TX",TX,Austin,...,,,,,,,,,,Male


In [568]:
private_companies.shape

(2281, 92)

In [569]:
public_companies = df.drop(df[df['symbol'] == 'PRIVATE'].index)
public_companies.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,total_yearly_compensation,base_salary,bonus,company,city_id,location,state,city,...,pretax_margin,profit_margin,research_development_over_revenue,average_volume,relative_volume,rsi,tax_over_revenue,revenue_growth_this_quarter,revenue_growth_next_quarter,gender
0,5928,0,400000,210000,45000,ALPHABET,7472,"Sunnyvale, CA",CA,Sunnyvale,...,27.89%,24.01%,14.78%,22956359.0,72.27%,64.81,3.88%,16.00%,14.35%,Male
1,10063,1,136000,124000,11000,MICROSOFT,11521,"Redmond, WA",WA,Redmond,...,44.47%,36.27%,12.09%,22842031.0,72.63%,56.25,8.20%,17.27%,17.13%,Male
2,5929,2,337000,177000,36000,ALPHABET,7413,"San Bruno, CA",CA,San Bruno,...,27.89%,24.01%,14.78%,22956359.0,72.27%,64.81,3.88%,16.00%,14.35%,Male
3,10064,3,222000,164000,20000,MICROSOFT,11527,"Seattle, WA",WA,Seattle,...,44.47%,36.27%,12.09%,22842031.0,72.63%,56.25,8.20%,17.27%,17.13%,Male
4,3787,4,187000,165000,0,BLEND,7419,"San Francisco, CA",CA,San Francisco,...,-113.87%,-118.15%,52.02%,1863929.0,43.24%,56.52,0.06%,-5.65%,-0.47%,Male


In [570]:
public_companies.shape

(14496, 92)

In [571]:
public_companies = public_companies.drop(public_companies[public_companies['symbol'] == 'INTERNATIONAL'].index)

In [572]:
public_companies.shape

(14208, 92)

# Subsetting Columns for Investigation

In [574]:
for column in public_companies.columns:
    print(column)

Unnamed: 0.1
Unnamed: 0
total_yearly_compensation
base_salary
bonus
company
city_id
location
state
city
level
title
tag
years_of_experience
years_at_company
stock_grant_value
dmaid
row_number
race
education
symbol
company_name
market_cap
stock_price
% Change
industry
pe_ratio
Ent. Value
mc_group
sector
forward_pe
country
state_of_hq
employees
employee_change
employee_growth
revenue
rev_growth_3y
rev_growth_5y
rev_growth_yrs
gross_profit
gross_profit_growth
gross_profit_growth_quarter
gross_profit_growth_5Y
gross_profit_growth_3y
net_income
operating_income
operating_income_growth
operating_income_growth_quarter
operating_income_growth_3y
operating_income_growth_5y
net_income_growth
net_income_growth_quarter
net_income_growth_3y
net_income_growth_5y
net_income_growth_years
net_income_growth_quarters
total_cash
total_debt
revenue_per_employee
profit_per_employee
working_capital
equity
net_working_capital
revenue_growth_this_year
revenue_growth_next_5y
revenue_growth_next_year
revenue_gro

### Columns to be included
base_salary, years_of_experience, years_at_company, race, gender, education, industry, mc_group, sector, employees, state_of_hq, cash_over_market_cap, revenue_per_employee, profit_per_employee, debt_growth_year_over_year, rsi  

In [581]:
public_companies_subset = public_companies[['base_salary', 'years_of_experience', 'years_at_company',
                                            'education', 'race', 'gender', 'mc_group', 'sector', 'industry',
                                            'state_of_hq', 'employees', 'revenue_per_employee', 'profit_per_employee',
                                            'cash_over_market_cap', 'debt_growth_year_over_year', 'rsi']]

In [582]:
public_companies_subset.head()

Unnamed: 0,base_salary,years_of_experience,years_at_company,education,race,gender,mc_group,sector,industry,state_of_hq,employees,revenue_per_employee,profit_per_employee,cash_over_market_cap,debt_growth_year_over_year,rsi
0,210000,5,5,PhD,Asian,Male,Mega-Cap,Communication Services,Internet Content & Information,California,182381.0,1685450.0,404620.0,4.54%,-5.47%,64.81
1,124000,3,2,Bachelor's Degree,Two Or More,Male,Mega-Cap,Technology,Software - Infrastructure,Washington,221000.0,1029787.0,373489.0,-0.23%,47.01%,56.25
2,177000,6,6,Bachelor's Degree,Asian,Male,Mega-Cap,Communication Services,Internet Content & Information,California,182381.0,1685450.0,404620.0,4.54%,-5.47%,64.81
3,164000,4,4,Master's Degree,Asian,Male,Mega-Cap,Technology,Software - Infrastructure,Washington,221000.0,1029787.0,373489.0,-0.23%,47.01%,56.25
4,165000,5,0,Bachelor's Degree,White,Male,Small-Cap,Technology,Software - Application,California,881.0,178032.0,-210352.0,-1.06%,-36.23%,56.52


In [583]:
public_companies_subset.isna().sum()

base_salary                     0
years_of_experience             0
years_at_company                0
education                       0
race                            0
gender                          0
mc_group                        0
sector                          0
industry                        0
state_of_hq                   535
employees                      11
revenue_per_employee           32
profit_per_employee            22
cash_over_market_cap            0
debt_growth_year_over_year    144
rsi                             8
dtype: int64