# Scraping the Bureau of Labor Statistics (BLS)

### Mining data from the web with Python

My aunt recently asked me for a website to compare salaries for different occupations across the country. I didn't know of one, but knew that the BLS likely had the best and most reliable data. The BLS has a data API but the documentation sucks.. I wanted the data but didn't want to mess with their API; I decided to scrape the data (604 webpages) instead. Practice makes perfect!

In [1]:
# Imports

from __future__ import division

from urllib import urlopen
from bs4 import BeautifulSoup
from collections import defaultdict

import pandas_profiling as pdpf
import scipy.stats as sts
import seaborn as sns
import numpy as np
import sqlite3
import math
import time
import copy
import sys
import re
import os

# Settings

import warnings
warnings.filterwarnings('ignore')

sys.path.extend([r'C:\Users\michael\Documents\_python\modules'])
import data_science_tools as dst
import data_visualization_tools as vst

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

# main db
main_db = r'C:\Users\michael\Documents\_databases\master.db'

The BLS (https://www.bls.gov/) publishes (for now) a bunch of data about labor. Of particular interest here are the tables of labor/earning data listed by Metropolitan and Nonmetropolitan Areas useful to answer the general question:

'how much should I  expect to be paid to do x [occupation] in y [geographic area]?'

In [2]:
# Get connection to bls.gov
try:
    html = urlopen(r'https://www.bls.gov/oes/current/oessrcma.htm')
except:
    print 'error opening url'

# Retrieve the raw html
try:
    soup = BeautifulSoup(html, 'html.parser')
except:
    print 'error reading html'

Scrape the main page a get a list of all the metro/non-metro areas and their complete links

In [3]:
# Find all hyperlinks in html text
main_links = [(link, re.sub('_+', '_', re.sub('[^a-zA-Z]', '_', link.text)))
              for link in soup.find_all('a', href=True) 
              if link['href'].startswith('oes')]

# No duplicates
main_links = list(set(main_links))

# the root url
root_url = r'https://www.bls.gov/oes/current/'

# join root url and links
complete_links = [(root_url + link['href'], name) for link, name in main_links]

# Filter links - do this??

# These are the target tables
print '{} target tables identified'.format(len(complete_links))

604 target tables identified


These are the helper functions that search a webpage html, find table rows, and evaluate them to determine if they contain the data of interest. In this case, the relevant data rows have 11 fields, and the first row is always an occupation code in the format dd-dddd.

In [4]:
def is_salary_data(table_row):
    # Determines if a table_row is a salary entry
    fields = table_row.findAll("td")
    # must have 11 fields and start with dd-dddd @ field 0
    if (len(fields) == 11 and re.match('\d\d-\d\d\d\d', fields[0].text)):
        return True
    else:
        return False

def get_salaries(soup):
    # Reads the html and gets the data
    salaries = []
    all_rows_in_page = soup.findAll("tr")
    for row in all_rows_in_page:
        # Check it
        if is_salary_data(row):
            # Yield the text
            row_text = [td.text for td in row.find_all("td")]
            # Stack the row
            salaries.append(row_text)
    return salaries

cols = ['Occupation_code', 'Occupation_title', 'Level', 'Employment',
        'Employment_RSE', 'Employment_per_1k_jobs', 'Location_quotient',
        'Median_hourly_wage', 'Mean_hourly_wage', 'Annual_mean_wage', 
        'Mean_wage_RSE']

Now that we have a list of links, we need to iterate each one, request the page html, use the helper functions to ID the target data and collect it.

Once we start collecting the relevant data, we have to do something with it. We could simply write all the data to a .csv or .xlsx file (or multiple - e.g. by area), but since we are talking about scraping 600+ pages, we don't want to have to mess with that many tables for aggreate analysis. Thus, we should probably aggregate everything now (technically an append operation) and key the page data with a location field. Given the size of the resulting table (est 300 rows per table x 600 tables = 180,000 rows), we're gonna want the ability to then query data with SQL - enter SQlite, a light-weight, super useful, on-disk relational database.

sqlite3 is the Python API for SQLite; it is part of the standard library. 

In [5]:
# Build a master table
conn = sqlite3.connect(main_db)

# Keep track of the errors
misses = []

n = len(complete_links)
i = 0

# Work through the list of links:
#    Get the html
#    Find the table rows
#    Determine if target data
#    Collect (or not)
#    Add the location (source table id) tag
#    Append to master table

for link in complete_links:
    try:
        html = urlopen(link[0])
    except:
        print 'error opening url - {:.40}'.format(link[1])
        misses.append(link)
        continue

    try:
        soup = BeautifulSoup(html, 'html.parser')
        df = pd.DataFrame(get_salaries(soup), columns=cols)
        df['Location'] = link[1]
        df.to_sql('master', conn, index=True, if_exists="append")
        i += 1
        print 'scraped {} of {} tables: {:.40}'.format(i, n, link[1])
        time.sleep(1)
    except:
        print 'error reading html - {:.40}'.format(link[1])
        misses.append(link)
        
print 'Successfully scraped {} tables'.format(i)
        
conn.close()

# This will take about 15 minutes to get all the data

scraped 1 of 604 tables: Eastern_and_Southern_Colorado_nonmetropo
scraped 2 of 604 tables: Southwest_Alabama_nonmetropolitan_area
scraped 3 of 604 tables: Northern_New_Hampshire_nonmetropolitan_a
scraped 4 of 604 tables: Olympia_Tumwater_WA
scraped 5 of 604 tables: New_Haven_CT
scraped 6 of 604 tables: Gainesville_FL
scraped 7 of 604 tables: Battle_Creek_MI
scraped 8 of 604 tables: State_College_PA
scraped 9 of 604 tables: Fort_Collins_CO
scraped 10 of 604 tables: Southwest_Kansas_nonmetropolitan_area
scraped 11 of 604 tables: Champaign_Urbana_IL
scraped 12 of 604 tables: Barnstable_Town_MA
scraped 13 of 604 tables: Carbondale_Marion_IL
scraped 14 of 604 tables: Central_New_York_nonmetropolitan_area
scraped 15 of 604 tables: Corvallis_OR
scraped 16 of 604 tables: Portland_Vancouver_Hillsboro_OR_WA
scraped 17 of 604 tables: Northeast_Louisiana_nonmetropolitan_area
scraped 18 of 604 tables: Raleigh_NC
scraped 19 of 604 tables: Boston_Cambridge_Nashua_MA_NH
scraped 20 of 604 tables: Flint

scraped 159 of 604 tables: Yakima_WA
scraped 160 of 604 tables: Northeast_Alabama_nonmetropolitan_area
scraped 161 of 604 tables: Central_Missouri_nonmetropolitan_area
scraped 162 of 604 tables: Central_New_Hampshire_nonmetropolitan_ar
scraped 163 of 604 tables: Peoria_IL
scraped 164 of 604 tables: Monroe_LA
scraped 165 of 604 tables: Balance_of_Lower_Peninsula_of_Michigan_n
scraped 166 of 604 tables: Atlanta_Sandy_Springs_Roswell_GA
scraped 167 of 604 tables: Northeast_Oklahoma_nonmetropolitan_area
scraped 168 of 604 tables: Columbus_IN
scraped 169 of 604 tables: Owensboro_KY
scraped 170 of 604 tables: Kankakee_IL
scraped 171 of 604 tables: West_Puerto_Rico_nonmetropolitan_area
scraped 172 of 604 tables: Lynn_Saugus_Marblehead_MA_NECTA_Division
scraped 173 of 604 tables: Miami_Miami_Beach_Kendall_FL_Metropolita
scraped 174 of 604 tables: Connecticut_nonmetropolitan_area
scraped 175 of 604 tables: Merced_CA
scraped 176 of 604 tables: Central_Washington_nonmetropolitan_area
scraped 177 

scraped 316 of 604 tables: Kalamazoo_Portage_MI
scraped 317 of 604 tables: Santa_Fe_NM
scraped 318 of 604 tables: Miami_Fort_Lauderdale_West_Palm_Beach_FL
scraped 319 of 604 tables: Charleston_North_Charleston_SC
scraped 320 of 604 tables: Waterbury_CT
scraped 321 of 604 tables: Beaumont_Port_Arthur_TX
scraped 322 of 604 tables: North_Coast_Region_of_California_nonmetr
scraped 323 of 604 tables: Casper_WY
scraped 324 of 604 tables: Augusta_Richmond_County_GA_SC
scraped 325 of 604 tables: Northern_West_Virginia_nonmetropolitan_a
scraped 326 of 604 tables: South_Central_Tennessee_nonmetropolitan_
scraped 327 of 604 tables: Sumter_SC
scraped 328 of 604 tables: Burlington_NC
scraped 329 of 604 tables: Columbus_OH
scraped 330 of 604 tables: North_Nevada_nonmetropolitan_area
scraped 331 of 604 tables: College_Station_Bryan_TX
scraped 332 of 604 tables: Southwest_New_York_nonmetropolitan_area
scraped 333 of 604 tables: Oshkosh_Neenah_WI
scraped 334 of 604 tables: Cumberland_MD_WV
scraped 335 

scraped 471 of 604 tables: Phoenix_Mesa_Scottsdale_AZ
scraped 472 of 604 tables: San_Jose_Sunnyvale_Santa_Clara_CA
scraped 473 of 604 tables: Morristown_TN
scraped 474 of 604 tables: Tucson_AZ
scraped 475 of 604 tables: Greenville_Anderson_Mauldin_SC
scraped 476 of 604 tables: Jackson_MS
scraped 477 of 604 tables: Abilene_TX
scraped 478 of 604 tables: Portland_South_Portland_ME
scraped 479 of 604 tables: Laredo_TX
scraped 480 of 604 tables: Virginia_Beach_Norfolk_Newport_News_VA_N
scraped 481 of 604 tables: Kingston_NY
scraped 482 of 604 tables: Visalia_Porterville_CA
scraped 483 of 604 tables: South_Bend_Mishawaka_IN_MI
scraped 484 of 604 tables: Southeast_Minnesota_nonmetropolitan_area
scraped 485 of 604 tables: Central_Southeast_Wyoming_nonmetropolita
scraped 486 of 604 tables: Gainesville_GA
scraped 487 of 604 tables: Yuma_AZ
scraped 488 of 604 tables: Middle_Georgia_nonmetropolitan_area
scraped 489 of 604 tables: Omaha_Council_Bluffs_NE_IA
scraped 490 of 604 tables: West_Kentucky_

OK, time to reel it in and see what we caught! 

Let's pull out the entire scraped dataset and get it's shape (as n rows, n cols)

In [5]:
# Get the whole master table
conn = sqlite3.connect(main_db)

master_df = pd.read_sql_query(
    "SELECT * FROM master;", conn)
master_df.shape

conn.close()

(239737, 13)

Almost 240k rows!

Here's what the data looks like:

In [6]:
master_df.head(20)

Unnamed: 0,index,Occupation_code,Occupation_title,Level,Employment,Employment_RSE,Employment_per_1k_jobs,Location_quotient,Median_hourly_wage,Mean_hourly_wage,Annual_mean_wage,Mean_wage_RSE,Location
0,0,00-0000,All Occupations,total,67110,2.8%,1000.0,1.0,$14.81,$18.40,"$38,260",1.3%,Eastern_and_Southern_Colorado_nonmetropolitan_...
1,1,11-0000,Management Occupations,major,2250,4.4%,33.498,0.66,$33.21,$39.40,"$81,950",2.7%,Eastern_and_Southern_Colorado_nonmetropolitan_...
2,2,11-1011,Chief Executives,detail,70,10.7%,1.074,0.68,$42.34,$42.09,"$87,550",4.1%,Eastern_and_Southern_Colorado_nonmetropolitan_...
3,3,11-1021,General and Operations Managers,detail,910,8.9%,13.582,0.87,$32.85,$43.51,"$90,510",6.7%,Eastern_and_Southern_Colorado_nonmetropolitan_...
4,4,11-1031,Legislators,detail,90,10.4%,1.404,3.67,(4),(4),"$44,440",3.4%,Eastern_and_Southern_Colorado_nonmetropolitan_...
5,5,11-3011,Administrative Services Managers,detail,60,27.7%,0.866,0.46,$30.71,$36.48,"$75,870",7.0%,Eastern_and_Southern_Colorado_nonmetropolitan_...
6,6,11-3031,Financial Managers,detail,100,12.4%,1.441,0.37,$48.92,$52.15,"$108,480",5.9%,Eastern_and_Southern_Colorado_nonmetropolitan_...
7,7,11-3051,Industrial Production Managers,detail,50,35.2%,0.696,0.58,$45.51,$49.27,"$102,480",5.0%,Eastern_and_Southern_Colorado_nonmetropolitan_...
8,8,11-9021,Construction Managers,detail,70,15.3%,1.106,0.62,$33.16,$35.14,"$73,090",6.9%,Eastern_and_Southern_Colorado_nonmetropolitan_...
9,9,11-9032,"Education Administrators, Elementary and Secon...",detail,210,5.4%,3.093,1.79,(4),(4),"$66,310",3.3%,Eastern_and_Southern_Colorado_nonmetropolitan_...


In [7]:
conn = sqlite3.connect(main_db)

occupations = pd.read_sql_query(
    "SELECT Occupation_title FROM master;", conn)

locations = pd.read_sql_query(
    "SELECT location FROM master;", conn)

occupations.drop_duplicates().sort_values('Occupation_title')
# locations.drop_duplicates().sort_values('Location')

conn.close()

Unnamed: 0,Occupation_title
22,Accountants and Auditors
2630,Actors
4764,Actuaries
5345,Adhesive Bonding Machine Operators and Tenders
4852,"Administrative Law Judges, Adjudicators, and H..."
5,Administrative Services Managers
70,Adult Basic and Secondary Education and Litera...
168,Advertising Sales Agents
1114,Advertising and Promotions Managers
10918,Aerospace Engineering and Operations Technicians


Jackpot! Now let's ask some questions.

I live in Pueblo, Colorado. How much should I expect to make doing [occupation]?

In [8]:
conn = sqlite3.connect(main_db)

pueblo = pd.read_sql_query(
    "SELECT Occupation_title, Location_quotient, Annual_mean_wage, location \
    FROM master WHERE location='Pueblo_CO';", conn)

pueblo.replace(['(8)', '(4)'], np.nan, inplace=True)
pueblo.Annual_mean_wage.replace(['(5)'], 208000, inplace=True)

number_fields = ['Annual_mean_wage', 'Location_quotient']

for nf in number_fields:
    pueblo[nf] = pueblo[nf].replace('[\$%,]', '', regex=True).astype(float)

pueblo.set_index('Occupation_title', inplace=True) 

pueblo.dropna(subset=['Annual_mean_wage'], inplace=True)

pueblo.sort_values('Annual_mean_wage', ascending=False)

conn.close()

Unnamed: 0_level_0,Location_quotient,Annual_mean_wage,Location
Occupation_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Surgeons,1.87,263370.0,Pueblo_CO
"Physicians and Surgeons, All Other",0.80,255840.0,Pueblo_CO
"Dentists, General",0.96,186360.0,Pueblo_CO
Family and General Practitioners,3.40,164470.0,Pueblo_CO
Pharmacists,1.37,123390.0,Pueblo_CO
Industrial Production Managers,0.83,123070.0,Pueblo_CO
Financial Managers,0.22,121090.0,Pueblo_CO
Psychiatrists,3.08,120890.0,Pueblo_CO
Personal Financial Advisors,,119990.0,Pueblo_CO
Medical and Health Services Managers,1.26,106250.0,Pueblo_CO


Surgeons are at the top.. No wonder they're usually dicks...


My wife is a pharmacist. How much should she expect to make in [area]?

It's important to consider market demand at this point too. Lets use a simple salary / location quotient ratio to ID areas with both high wages and high demand. Location quotients in excess of 1 indicate above-national-average employment rates in that field and area - a possible indication of market saturation. Lets look for areas with high wages and small (<1) location quotients.

In [9]:
# Pharmacist

conn = sqlite3.connect(main_db)

pharm = pd.read_sql_query(
    "SELECT Occupation_title, Location_quotient, Annual_mean_wage, location \
     FROM master WHERE Occupation_title='Pharmacists';", conn)

pharm.replace(['(8)', '(4)'], np.nan, inplace=True)
pharm.Annual_mean_wage.replace(['(5)'], 250000, inplace=True)

number_fields = ['Annual_mean_wage', 'Location_quotient']

for nf in number_fields:
    pharm[nf] = pharm[nf].replace('[\$%,]', '', regex=True).astype(float)

pharm.set_index('Location', inplace=True) 
pharm.drop_duplicates(inplace=True)

pharm['Wage_Loc'] = \
    np.round(pharm.Annual_mean_wage / pharm.Location_quotient, 0)

pharm.dropna(subset=['Annual_mean_wage'], inplace=True)
pharm.sort_values('Annual_mean_wage', ascending=False)

conn.close()

Unnamed: 0_level_0,Occupation_title,Location_quotient,Annual_mean_wage,Wage_Loc
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Victoria_TX,Pharmacists,1.02,165230.0,161990.0
Southeast_Alaska_nonmetropolitan_area,Pharmacists,0.56,159200.0,284286.0
Santa_Cruz_Watsonville_CA,Pharmacists,0.73,152170.0,208452.0
Northwest_Alabama_nonmetropolitan_area,Pharmacists,1.94,151810.0,78253.0
Gadsden_AL,Pharmacists,1.20,149000.0,124167.0
Chico_CA,Pharmacists,1.14,148230.0,130026.0
El_Centro_CA,Pharmacists,0.62,148010.0,238726.0
Santa_Rosa_CA,Pharmacists,1.15,146210.0,127139.0
Odessa_TX,Pharmacists,0.65,145700.0,224154.0
Laredo_TX,Pharmacists,0.39,145270.0,372487.0


Here's the scoop on Archaeology as well. Clearly, my earning potential is a LOT less than my wife's...... :/

In [10]:
# Anthropologists and Archeologists

conn = sqlite3.connect(main_db)

arch = pd.read_sql_query(
    "SELECT Occupation_title, Location_quotient, Annual_mean_wage, location \
     FROM master WHERE Occupation_title='Anthropologists and Archeologists';",
     conn)

arch.replace(['(8)', '(4)'], np.nan, inplace=True)
arch.Annual_mean_wage.replace(['(5)'], 250000, inplace=True)

number_fields = ['Annual_mean_wage', 'Location_quotient']

for nf in number_fields:
    arch[nf] = arch[nf].replace('[\$%,]', '', regex=True).astype(float)

arch.set_index('Location', inplace=True) 
arch.drop_duplicates(inplace=True)

arch['Wage_Loc'] = \
    np.round(arch.Annual_mean_wage / arch.Location_quotient, 0)

arch.dropna(subset=['Annual_mean_wage'], inplace=True)
arch.sort_values('Annual_mean_wage', ascending=False)

conn.close()

Unnamed: 0_level_0,Occupation_title,Location_quotient,Annual_mean_wage,Wage_Loc
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Houston_The_Woodlands_Sugar_Land_TX,Anthropologists and Archeologists,0.73,89060.0,122000.0
Montgomery_County_Bucks_County_Chester_County_PA_Metropolitan_Division,Anthropologists and Archeologists,,86470.0,
Washington_Arlington_Alexandria_DC_VA_MD_WV_Metropolitan_Division,Anthropologists and Archeologists,1.42,86350.0,60810.0
Philadelphia_Camden_Wilmington_PA_NJ_DE_MD,Anthropologists and Archeologists,,84060.0,
Anchorage_AK,Anthropologists and Archeologists,6.87,83170.0,12106.0
Washington_Arlington_Alexandria_DC_VA_MD_WV,Anthropologists and Archeologists,1.34,81500.0,60821.0
Tucson_AZ,Anthropologists and Archeologists,3.47,81040.0,23354.0
Sacramento_Roseville_Arden_Arcade_CA,Anthropologists and Archeologists,2.63,78000.0,29658.0
Eugene_OR,Anthropologists and Archeologists,,77170.0,
New_York_Newark_Jersey_City_NY_NJ_PA,Anthropologists and Archeologists,0.35,75990.0,217114.0


Returning to my aunt's original question about comparing salaries, we can now privide the relevant data. My aunt works in sales management. Where should she move???

In [11]:
# Sales Managers

conn = sqlite3.connect(main_db)

sales = pd.read_sql_query(
    "SELECT Occupation_title, Location_quotient, Annual_mean_wage, location \
     FROM master WHERE Occupation_title='Sales Managers';", conn)

sales.replace(['(8)', '(4)'], np.nan, inplace=True)
sales.Annual_mean_wage.replace(['(5)'], 250000, inplace=True)

number_fields = ['Annual_mean_wage', 'Location_quotient']

for nf in number_fields:
    sales[nf] = sales[nf].replace('[\$%,]', '', regex=True).astype(float)

sales.set_index('Location', inplace=True) 
sales.drop_duplicates(inplace=True)

sales['Wage_Loc'] = \
    np.round(sales.Annual_mean_wage / sales.Location_quotient, 0)

sales.dropna(subset=['Annual_mean_wage'], inplace=True)
sales.sort_values('Annual_mean_wage', ascending=False)

conn.close()

Unnamed: 0_level_0,Occupation_title,Location_quotient,Annual_mean_wage,Wage_Loc
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
New_York_Jersey_City_White_Plains_NY_NJ_Metropolitan_Division,Sales Managers,1.15,197800.0,172000.0
New_York_Newark_Jersey_City_NY_NJ_PA,Sales Managers,1.10,191360.0,173964.0
Nassau_County_Suffolk_County_NY_Metropolitan_Division,Sales Managers,0.59,183040.0,310237.0
Bridgeport_Stamford_Norwalk_CT,Sales Managers,2.64,180470.0,68360.0
Wilmington_DE_MD_NJ_Metropolitan_Division,Sales Managers,0.83,177140.0,213422.0
Fort_Collins_CO,Sales Managers,0.40,174150.0,435375.0
Philadelphia_PA_Metropolitan_Division,Sales Managers,0.55,173010.0,314564.0
Vineland_Bridgeton_NJ,Sales Managers,0.48,169310.0,352729.0
Newark_NJ_PA_Metropolitan_Division,Sales Managers,1.46,168070.0,115116.0
Framingham_MA_NECTA_Division,Sales Managers,2.01,167600.0,83383.0


My aunt lives in Napa. Let's see what employment in Napa looks like, starting from the top:

In [12]:
# Napa

conn = sqlite3.connect(main_db)

napa = pd.read_sql_query(
    "SELECT * FROM master WHERE location='Napa_CA';", conn)

napa.replace(['(8)', '(4)'], np.nan, inplace=True)
napa.Annual_mean_wage.replace(['(5)'], 208000, inplace=True)

number_fields = ['Annual_mean_wage', 'Location_quotient']

for nf in number_fields:
    napa[nf] = napa[nf].replace('[\$%,]', '', regex=True).astype(float)

napa.set_index('Occupation_title', inplace=True) 
napa.drop_duplicates(inplace=True)

napa.dropna(subset=['Annual_mean_wage'], inplace=True)

napa.sort_values('Annual_mean_wage', ascending=False)

conn.close()

Unnamed: 0_level_0,index,Occupation_code,Level,Employment,Employment_RSE,Employment_per_1k_jobs,Location_quotient,Median_hourly_wage,Mean_hourly_wage,Annual_mean_wage,Mean_wage_RSE,Location
Occupation_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"Physicians and Surgeons, All Other",95,29-1069,detail,120,27.3%,1.625,0.67,(5),$123.36,256600.0,13.3%,Napa_CA
Psychiatrists,94,29-1066,detail,210,33.6%,2.972,16.81,(5),(5),208000.0,11.3%,Napa_CA
Family and General Practitioners,92,29-1062,detail,70,27.5%,1.047,1.20,(5),$93.78,195070.0,10.1%,Napa_CA
"Internists, General",93,29-1063,detail,40,37.4%,0.504,1.56,$85.86,$91.56,190450.0,23.8%,Napa_CA
Chief Executives,2,11-1011,detail,110,10.1%,1.484,0.93,$86.35,$90.39,188020.0,6.9%,Napa_CA
"Dentists, General",89,29-1021,detail,50,39.1%,0.728,0.97,$74.93,$84.82,176420.0,20.9%,Napa_CA
Personal Financial Advisors,36,13-2052,detail,60,22.6%,0.803,0.56,$84.99,$82.07,170700.0,21.6%,Napa_CA
Lawyers,70,23-1011,detail,150,18.2%,2.100,0.48,$67.72,$71.65,149040.0,3.3%,Napa_CA
Sales Managers,5,11-2022,detail,400,12.8%,5.651,2.17,$55.49,$70.95,147570.0,10.4%,Napa_CA
Financial Managers,8,11-3031,detail,270,10.6%,3.876,1.00,$67.42,$70.17,145950.0,4.8%,Napa_CA


Aaahhhh.. I should have been a surgeon. I'd be rich and have a good excuse for being a dick. If only everything about it didn't give me nightmares.. alas..

Lets save the Napa and Sales tables to send to my aunt and go get a beer. Success!

In [13]:
sales.to_csv('sales_management_salaries.csv', header=True, index=True)
napa.to_csv('napa_salaries.csv', header=True, index=True)

# Footnotes
(1) Estimates for detailed occupations do not sum to the totals because the totals include occupations not shown separately. Estimates do not include self-employed workers.

(2) Annual wages have been calculated by multiplying the hourly mean wage by a "year-round, full-time" hours figure of 2,080 hours; for those occupations where there is not an hourly wage published, the annual wage has been directly calculated from the reported survey data.

(3) The relative standard error (RSE) is a measure of the reliability of a survey statistic. The smaller the relative standard error, the more precise the estimate.

(4) Wages for some occupations that do not generally work year-round, full time, are reported either as hourly wages or annual salaries depending on how they are typically paid.

(5) This wage is equal to or greater than 100.00 per hour or 208,000 per year.

(8) Estimates not released.

(9) The location quotient is the ratio of the area concentration of occupational employment to the national average concentration. A location quotient greater than one indicates the occupation has a higher share of employment than average, and a location quotient less than one indicates the occupation is less prevalent in the area than average.