# Capstone Project

## Exploring Gender Wage Disparities

In this notebook I will be exploring the US Bureau of Labor and Statistics January 2015 report on income by gender and occupation. 

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import cross_val_score, train_test_split
%matplotlib inline

Inputing data, cleaning it up, exploring and creating new features.

In [58]:
genderwage = pd.read_csv('/Users/Beba/Documents/JupyterNotebooks/CapstoneProject/inc_occ_gender.csv')
genderwage.head(30)

Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly
0,ALL OCCUPATIONS,109080,809,60746,895,48334,726
1,MANAGEMENT,12480,1351,7332,1486,5147,1139
2,Chief executives,1046,2041,763,2251,283,1836
3,General and operations managers,823,1260,621,1347,202,1002
4,Legislators,8,Na,5,Na,4,Na
5,Advertising and promotions managers,55,1050,29,Na,26,Na
6,Marketing and sales managers,948,1462,570,1603,378,1258
7,Public relations and fundraising managers,59,1557,24,Na,35,Na
8,Administrative services managers,170,1191,96,1451,73,981
9,Computer and information systems managers,636,1728,466,1817,169,1563


In [45]:
genderwage.describe()

Unnamed: 0,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly
count,558.0,322.0,558.0,232.0,558.0,192.0
mean,586.458781,910.052795,326.59319,1002.689655,259.831541,805.432292
std,4758.979138,353.261167,2654.600532,398.311869,2142.551053,305.503765
min,0.0,354.0,0.0,389.0,0.0,380.0
25%,21.0,626.0,11.0,678.75,3.0,533.0
50%,67.0,856.0,33.5,915.5,18.0,736.0
75%,253.0,1125.25,121.75,1265.25,84.0,988.5
max,109080.0,2041.0,60746.0,2251.0,48334.0,1836.0


In [29]:
genderwage.dtypes

Occupation     object
All_workers     int64
All_weekly     object
M_workers       int64
M_weekly       object
F_workers       int64
F_weekly       object
dtype: object

In [30]:
genderwage[['All_weekly',
            'M_weekly',
            'F_weekly']] = genderwage[['All_weekly',
                                       'M_weekly',
                                       'F_weekly']].apply(pd.to_numeric, errors='coerce')

In [31]:
genderwage.dtypes

Occupation      object
All_workers      int64
All_weekly     float64
M_workers        int64
M_weekly       float64
F_workers        int64
F_weekly       float64
dtype: object

In [32]:
genderwage.isnull().sum()

Occupation       0
All_workers      0
All_weekly     236
M_workers        0
M_weekly       326
F_workers        0
F_weekly       366
dtype: int64

In [59]:
list(genderwage['Occupation'].where(genderwage.isnull()==True))

['ALL OCCUPATIONS',
 'MANAGEMENT',
 'Chief executives',
 'General and operations managers',
 'Legislators',
 'Advertising and promotions managers',
 'Marketing and sales managers',
 'Public relations and fundraising managers',
 'Administrative services managers',
 'Computer and information systems managers',
 'Financial managers',
 'Compensation and benefits managers',
 'Human resources managers',
 'Training and development managers',
 'Industrial production managers',
 'Purchasing managers',
 'Transportation, storage, and distribution managers',
 'Farmers, ranchers, and other agricultural managers',
 'Construction managers',
 'Education administrators',
 'Architectural and engineering managers',
 'Food service managers',
 'Funeral service managers',
 'Gaming managers',
 'Lodging managers',
 'Medical and health services managers',
 'Natural sciences managers',
 'Postmasters and mail superintendents',
 'Property, real estate, and community association managers',
 'Social and community s

In [47]:
cleanedgenderwage = genderwage.dropna(axis=0, how='any')

In [48]:
cleanedgenderwage.describe()

Unnamed: 0,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly
count,142.0,142.0,142.0,142.0,142.0,142.0
mean,2065.78169,921.098592,1122.471831,1017.535211,943.260563,827.471831
std,9298.336804,368.672668,5190.515099,411.970023,4182.313327,323.042686
min,108.0,391.0,53.0,401.0,50.0,380.0
25%,253.25,619.25,108.25,675.0,107.0,566.75
50%,560.5,898.5,278.0,992.5,201.5,773.5
75%,1294.5,1162.5,620.25,1343.75,551.25,1021.0
max,109080.0,2041.0,60746.0,2251.0,48334.0,1836.0


In [49]:
list(cleanedgenderwage.Occupation)

['ALL OCCUPATIONS',
 'MANAGEMENT',
 'Chief executives',
 'General and operations managers',
 'Marketing and sales managers',
 'Administrative services managers',
 'Computer and information systems managers',
 'Financial managers',
 'Human resources managers',
 'Purchasing managers',
 'Transportation, storage, and distribution managers',
 'Education administrators',
 'Food service managers',
 'Lodging managers',
 'Medical and health services managers',
 'Property, real estate, and community association managers',
 'Social and community service managers',
 'Managers, all other',
 'BUSINESS',
 'Wholesale and retail buyers, except farm products',
 'Purchasing agents, except wholesale, retail, and farm products',
 'Claims adjusters, appraisers, examiners, and investigators',
 'Compliance officers',
 'Human resources workers',
 'Management analysts',
 'Market research analysts and marketing specialists',
 'Business operations specialists, all other',
 'Accountants and auditors',
 'Financial 

This is a lot of null values for a dataframe with only 558 rows.

Lets do some feature engineering and make some new columns to explore.

In [None]:
# make wage gap column

In [None]:
# make gender ratio column

In [None]:
# make visual of incomes for both genders

In [None]:
# make visual for gender ratio by occupation

In [None]:
# make visual for wage gap by occupation