
<div align="center"><img width="375" height="35" src="https://t1nc.org/wp-content/uploads/2018/08/SAN-ANTONIO-THUMB-shutterstock_448844578-660--768x614.jpg" /> </div> 


<div align="center"> <h1>Individual Project - Predicting Gender based on Salary Data</h1> 
  <h6> by David Berchelmann -- April 9, 2021 </h6> </div>
  
  ------------------------------------------------


---

<h1> Welcome! </h1>

The following jupyter notebook will take you through my individual project focusing on the relationship between gender and salary for the city of San ANtonio. The dataset comes from data.world and can be accessed here https://data.world/amillerbernd/san-antonio-city-salary-data or via csv from my git hub. 

All of the files and notebooks for this project can be accessed via the github repostiory located at --> https://github.com/DBerchelmann/employee-classification

For ease of reading, many of the large coding sections have been minimzed to allow for a better scrolling experience. If you would like to enlarge a cell to see the data inside, please click on the three dots (<b>...</b>) for the specific cell. To reduce the cell, click the blue box to the left of the selected cell.

----

<a id='back'></a>
### Quick Links to Sections within this Notebook

- [Executive Summary](#BC)
- [Acquire Data](#AD)
- [Prepare Data](#PD)
- [Explore Data](#Explore)
- [Data Dictionary](#DY)
- [Hypothesis Testing](#Hypo)
- [Clustering](#CD)
- [Modeling](#Model)
- [Evaluate](#Eval)
- [Recommendations & Key Takeaways](#Conclusion)

<h1> Executive Summary </h1>

<a id='BC'></a>

[back to top](#back)

------

<h4><b>The Problem</b></h4>

- Is there a gender pay gap at the city of San Antonio?

<h4><b>The Goal</b></h4>

- Use classification to determine if gender can be predicted using salary data from fiscal year 2016

<h4><b>The Process</b></h4>

  * Acquire the Data
  * Prepare 
  * Explore 
  * Model
  * Create Recommendations Based On Findings 
  
<h4><b>The Findings</b></h4>

- Exploration revealed that there is definitley a gender pay gap
- My Random Forest classification model accurately predicted gender 76% of the time beating the baseline of 65%
- Not only is there gender pay gap but there is also a discrepancy in pay by ethnicity
- More indepth analysis needs to be done 
- Modeling can be further refined by splitting up the salaries by department and investigating pay gap discrepancies.
     - <i>Baseline Accuracy --> 65% </i>
     - <b>Random Forest Accuracy on out of sample test data --> 76%</b>

    
    


-------


-----
<h3> Environment Setup</h3>

----

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
from graphviz import Graph
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import plotly.express as px
from datetime import date 
from wrangle import new_city_data, clean_city, missing_zero_values_table, train_validate_test_split
import explore

from model import run_model

<h4> Data Validation </h4>

 - Before the data was brought in through wrangle.py, I looked into via excel. Below are a few of the findings:
     - The hire date column needed to be formatted to a readable format
     - There would be some opportunity to clean up the columns with varying decimal points
     - There would also be an opportunity create more categorical features by splitting some data up

---
<h3><u>Acquire the Data</u></h3>

----

<a id='AD'></a>

[back to top](#back)

In [4]:
df = new_city_data()

In [6]:
print(f'Our original dataframe is coming in with {df.shape[0]} rows and {df.shape[1]} columns.')

Our original dataframe is coming in with 11923 rows and 17 columns.


In [7]:
df.describe()

Unnamed: 0,FY16 ANNUAL SALARY2,FY16 BASE PAY3,FY16 LEAVE PAYOUT4,FY16 OTHER5,FY16 OVERTIME6,FY16 GROSS EARNINGS7,FY16 ADDITIONAL BENEFITS8,FY16 TOTAL COMPENSATION9
count,11923.0,11923.0,11923.0,11923.0,11923.0,11923.0,11923.0,11923.0
mean,51665.146688,46521.141886,1592.511874,5246.367811,4124.500218,57484.521789,24065.96789,81550.48968
std,22426.198015,26318.791088,2211.263406,7549.287121,8160.922331,36791.795606,16849.396949,52549.340694
min,18200.0,0.0,0.0,-100.0,-239.45,0.0,0.0,0.0
25%,32607.38,30025.24,0.0,53.08,0.0,32118.335,14187.673405,46409.635645
50%,49188.1,46419.88,660.3,825.42,356.39,51073.58,17811.97902,68812.36367
75%,65155.49,64428.0,2220.64,9805.38,4713.355,85713.08,44431.45189,128011.5982
max,425000.0,414615.38,16947.96,97354.89,68212.29,511970.27,75379.48,587349.75


In [8]:
missing_zero_values_table(df)

Your selected dataframe has 17 columns and 11923 Rows.
There are 1 columns that have NULL values.


Unnamed: 0,Zero Values,null_count,% of Total Values,Total Zeroes + Null Values,% Total Zero + Null Values,Data Type
MIDDLE NAME,0,5662,47.5,5662,47.5,object
FIRST NAME,0,0,0.0,0,0.0,object
FY16 GROSS EARNINGS7,32,0,0.0,32,0.3,float64
ETHNIC ORIGIN10,0,0,0.0,0,0.0,object
GENDER,0,0,0.0,0,0.0,object
BUSINESS AREA,0,0,0.0,0,0.0,object
JOB TITLE,0,0,0.0,0,0.0,object
FY16 TOTAL COMPENSATION9,15,0,0.0,15,0.1,float64
FY16 ADDITIONAL BENEFITS8,18,0,0.0,18,0.2,float64
FY16 OVERTIME6,4562,0,0.0,4562,38.3,float64


-----
<h3><u> Initial Thoughts</u> </h3>

- Really need to clean this data set up before doing any initial looking exploration
- Going to rename columns for readability
- Clean up numbers
- Need to get rid of first, middle, and last names



----