# Houston, we have a Salary!

### Executive Summary
The goal of this analysis is to 

### Project Overview

### Key Takeaways

### Data Dictionary

## Imports that will be utilized throughout this notebook



In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

# custom module imports
import Acquire as aq
import Prepare as pr
# import explore as ex

# feature selection imports
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE

# import scaling methods
from sklearn.preprocessing import RobustScaler, StandardScaler
from scipy import stats
from sklearn.model_selection import train_test_split

# import modeling methods
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import explained_variance_score
from scipy import stats

# import to remove warnings
import warnings
warnings.filterwarnings("ignore")

## Acquire 

The dataset was acquired from the [Texas Tribune's website](https://salaries.texastribune.org/) and downloading all the data as a csv file. We will read in the csv as a Pandas dataframe using functions from the Acquire script.

The Acquire script contains functions to obtain the data, cache it as well as provide summary statistics.

In [2]:
# read data as df
df = aq.get_texas_data()

In [3]:
# quick look at data 
df.head()

Unnamed: 0,AGY,NAME,LASTNAME,FIRSTNAME,MI,JOBCLASS,JC TITLE,RACE,SEX,EMPTYPE,...,RATE,HRSWKD,MONTHLY,ANNUAL,STATENUM,duplicated,multiple_full_time_jobs,combined_multiple_jobs,summed_annual_salary,hide_from_search
0,101,SENATE ...,GILLIAM,STACEY,L,7101,LEG. OFFICIAL/ADMINISTRATOR ...,WHITE,FEMALE,URP - UNCLASSIFIED REGULAR PART-TIME,...,0.0,20.0,8100.0,97200.0,339371,True,,,181200.0,
1,104,LEGISLATIVE BUDGET BOARD ...,GILLIAM,STACEY,L,C160,COMMITTEE DIRECTOR ...,WHITE,FEMALE,URP - UNCLASSIFIED REGULAR PART-TIME,...,0.0,20.0,7000.0,84000.0,339371,True,,,,True
2,101,SENATE ...,NELSON,DAVID,,7101,LEG. OFFICIAL/ADMINISTRATOR ...,WHITE,MALE,URP - UNCLASSIFIED REGULAR PART-TIME,...,0.0,20.0,9500.0,114000.0,193187,True,,,210000.0,
3,104,LEGISLATIVE BUDGET BOARD ...,NELSON,DAVID,,P080,SENIOR BUDGET ADVISOR ...,WHITE,MALE,URP - UNCLASSIFIED REGULAR PART-TIME,...,0.0,20.0,8000.0,96000.0,193187,True,,,,True
4,101,SENATE ...,ROCHA,MARIE,S,7103,LEG. SERVICE/MAINTENANCE ...,HISPANIC,FEMALE,URF - UNCLASSIFIED REGULAR FULL-TIME,...,0.0,41.0,3365.4,40384.8,152257,True,,True,,


In [3]:
# look at all columns in df
df.columns

Index(['AGY', 'NAME', 'LASTNAME', 'FIRSTNAME', 'MI', 'JOBCLASS', 'JC TITLE',
       'RACE', 'SEX', 'EMPTYPE', 'HIREDT', 'RATE', 'HRSWKD', 'MONTHLY',
       'ANNUAL', 'STATENUM', 'duplicated', 'multiple_full_time_jobs',
       'combined_multiple_jobs', 'summed_annual_salary', 'hide_from_search'],
      dtype='object')

In [4]:
# Check the different classes in each variable
df.nunique()

AGY                           111
NAME                          111
LASTNAME                    38227
FIRSTNAME                   23267
MI                             28
JOBCLASS                     1474
JC TITLE                     1406
RACE                            6
SEX                             2
EMPTYPE                         9
HIREDT                       6257
RATE                          207
HRSWKD                         47
MONTHLY                     39035
ANNUAL                      39037
STATENUM                   144727
duplicated                      1
multiple_full_time_jobs         1
combined_multiple_jobs          1
summed_annual_salary           11
hide_from_search                1
dtype: int64

- Displaying value_counts for the following will be helpful as these variables seem to have a few classes:
    - `RACE`
    - `SEX`
    - `EMPTYPE`
- Other variables besides the above seem to be continuous.

In [3]:
# summarize the df and its columns
aq.get_data_summary(df)

The dataframe has 144738 rows and 21 columns.

-------------------
There are total of 723537 missing values in the entire dataframe.

-------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144738 entries, 0 to 144737
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   AGY                      144738 non-null  int64  
 1   NAME                     144738 non-null  object 
 2   LASTNAME                 144738 non-null  object 
 3   FIRSTNAME                144738 non-null  object 
 4   MI                       144738 non-null  object 
 5   JOBCLASS                 144738 non-null  object 
 6   JC TITLE                 144738 non-null  object 
 7   RACE                     144738 non-null  object 
 8   SEX                      144738 non-null  object 
 9   EMPTYPE                  144738 non-null  object 
 10  HIREDT                   144738 non-null  object 
 11  RATE           

## Takeaways

- The dataframe has a nice set of observations.
- Annual salary is our target variable
- Very few values in duplicated, multiple_full_time_jobs, combined_multiple_jobs and summed_annual_salary. There are no null values in any other columns
- Each observation is an employee for the Texas State Government.
- AGY column is the id for the department the employee works with.
- Most variables are upper case which should be converted to lower case.
- The column `MONTHLY` is the monthly salary.
    - The target variable `ANNUAL` is derived from this column.
- Most variables are object data types and would need to be one hot encoded in order to be utilized for modeling.
- Highest annual salary is 553,500 USD while the minimum is 600 USD.
- Max hours worked is 80 while minimum is 2.
- Highest proportion of employees is white and the lowest proportion are Native Americans.
- There are more females than males in the dataset.
- Most employees are classified regular full time.