# _Physician Compare National: Explore #7_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal_

Finish cleaning up the rest of the columns. If I get the opportunity to start exploring, awesome, but it is not a big deal if I don't get to it. 

In [1]:
from datetime import datetime

# current date and time
now = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now)

Work started:  2019-10-03 09:06:22.915319


In [2]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-2.csv', low_memory=False);

# inspect the first five rows
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,24440 STONE SPRINGS BLVD,N,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,1401 JOHNSTON WILLIS DR,N,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,411 W RANDOLPH RD,N,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


Before we dive into the `zip` column, let's consult the documentation on this column to see if there might be any caveats to it. 

- `zip`: the group practice of the individual's zip code (9 digits when available)

Zipcodes, traditionally, are five digits that signify an area of the United States. In 1983 though, an extended ZIP+4 code was introduced. These additional four digits designate a more specific location. For our analysis, I do not think we'll need the +4 digits. We'll still have the five-digit zip code, which provides us with the information we need. Additionally, if we have any plans on trying to apply FIPS codes to the zip codes (some geography-focused libraries utilize FIPS of ZIP), these +4 digits might make that conversion slightly more complicated.

In [3]:
# what data type is the zip code column?
data['zip'].dtype

dtype('O')

In [13]:
# what are the unique lengths of the zip codes?
print('Length of zip codes take the following values: {}'.format(sorted(data['zip'].str.len().unique())))

Length of zip codes take the following values: [3, 4, 5, 7, 8, 9]


It looks like we have an issue: the zip codes take a range of different lengths from three characters to nine characters. Let us examine this a little bit further to see if we might be able to figure out what is going on. 

In [14]:
# what are the value counts for each length of the zip codes?
data['zip'].str.len().value_counts()

9    2059140
8     132917
5      13060
7       3039
3       1710
4        924
Name: zip, dtype: int64

In [15]:
# select a subset of observations where the zip code length is equal to 8 characters in length
data[data['zip'].str.len() == 8].head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
46,1003001272,8820183544,I20070926000754,DEANNE E OCHOA DURRELL,F,Not Listed,OTHER,2004,CLINICAL SOCIAL WORKER,,,,,,MAINEGENERAL MEDICAL CENTER,1254246000.0,371,9 PLEASANT ST,N,OAKLAND,ME,49635074,2074652000.0,200039.0,MAINE GENERAL MEDICAL CENTER,,,,,,,,,Y
47,1003001272,8820183544,I20070926000754,DEANNE E OCHOA DURRELL,F,Not Listed,OTHER,2004,CLINICAL SOCIAL WORKER,,,,,,"HEALTH AFFILIATES MAINE, LLC",1951594000.0,134,306 RODMAN RD,N,AUBURN,ME,42103830,2073333000.0,200039.0,MAINE GENERAL MEDICAL CENTER,,,,,,,,,Y
64,1003001587,9234213760,I20160121002007,MARY C TRAN,F,Not Listed,OTHER,2007,PHYSICIAN ASSISTANT,,,,,,"MHS PRIMARY CARE, INC",4082507000.0,116,896 WASHINGTON ST MIDDLESEX HOSPITAL URGENT CARE,N,MIDDLETOWN,CT,64572912,8607884000.0,70011.0,CHARLOTTE HUNGERFORD HOSPITAL,,,,,,,,,Y
65,1003001587,9234213760,I20160121002007,MARY C TRAN,F,Not Listed,OTHER,2007,PHYSICIAN ASSISTANT,,,,,,CHARLOTTE HUNGERFORD HOSPITAL,4486558000.0,114,540 LITCHFIELD ST,N,TORRINGTON,CT,67906679,,70011.0,CHARLOTTE HUNGERFORD HOSPITAL,,,,,,,,,Y
69,1003001678,1456584564,I20170831000516,REBECCA D EIRICH,F,Not Listed,OTHER,2006,NURSE PRACTITIONER,,,,,,REGIONAL WOMENS HEALTH GROUP LLC,2961316000.0,253,239 HURFFVILLE CROSSKEYS RD SUITE 250,N,SEWELL,NJ,80804006,8562628000.0,,,,,,,,,,,Y


Ok so this is where things could get a little complicated. I don't have a hypothesis yet as to why there are so many different lengths for the zip codes, espcially when it comes to the length of 3 and 4. It seems logical that this could just be the case of typos; however, we need to figure out how to most accurately figure out how to relabel these zip codes correctly. Luckily, we have the `cty` and `st` for each observation, and with this information we might be able to address this problem with the help of the [`uszipcode`](https://uszipcode.readthedocs.io/index.html) library. It is the "_the most powerful and easy to use programmable zipcode database in Python_." So let's give it a shot and see how it might help us. 

In [17]:
# if you do not have the library installed, run the command below to install uszipcode
# !pip install uszipcode

In [18]:
# load zip code library
from uszipcode import SearchEngine

# create search object for zip code look-up
search = SearchEngine(simple_zipcode=True)

Start downloading data for simple zipcode database, total size 9MB ...
  1 MB finished ...
  2 MB finished ...
  3 MB finished ...
  4 MB finished ...
  5 MB finished ...
  6 MB finished ...
  7 MB finished ...
  8 MB finished ...
  9 MB finished ...
  10 MB finished ...
  Complete!


In [29]:
# use the city and state from the first observation above - Oakland, ME - to return information on zip code
oakland_maine = search.by_city_and_state(city='Oakland', state='ME'); oakland_maine

[SimpleZipcode(zipcode='04963', zipcode_type='Standard', major_city='Oakland', post_office_city='Oakland, ME', common_city_list=['Oakland', 'Rome'], county='Kennebec County', state='ME', lat=44.6, lng=-69.8, timezone='Eastern', radius_in_miles=10.0, area_code_list=['207'], population=7238, population_density=142.0, land_area_in_sqmi=51.0, water_area_in_sqmi=8.81, housing_units=4050, occupied_housing_units=2977, median_home_value=159500, median_household_income=56994, bounds_west=-69.947541, bounds_east=-69.665525, bounds_north=44.621506, bounds_south=44.508191)]

In [36]:
# because it is returned as a list for some reason, we must extract the first object from the list which is the SimpleZipcode, then convert
# it to a dictionary
oakland_maine = oakland_maine[0].to_dict(); oakland_maine

{'zipcode': '04963',
 'zipcode_type': 'Standard',
 'major_city': 'Oakland',
 'post_office_city': 'Oakland, ME',
 'common_city_list': ['Oakland', 'Rome'],
 'county': 'Kennebec County',
 'state': 'ME',
 'lat': 44.6,
 'lng': -69.8,
 'timezone': 'Eastern',
 'radius_in_miles': 10.0,
 'area_code_list': ['207'],
 'population': 7238,
 'population_density': 142.0,
 'land_area_in_sqmi': 51.0,
 'water_area_in_sqmi': 8.81,
 'housing_units': 4050,
 'occupied_housing_units': 2977,
 'median_home_value': 159500,
 'median_household_income': 56994,
 'bounds_west': -69.947541,
 'bounds_east': -69.665525,
 'bounds_north': 44.621506,
 'bounds_south': 44.508191}

In [39]:
# extract the zip code from the dictionary we created above
print('The zip code for {} is {}.'.format(oakland_maine['post_office_city'], oakland_maine['zipcode']))

The zip code for Oakland, ME is 04963.


In [45]:
# to test it on one more observation let's use the next city listed - Middletown, CT
middletown_ct = search.by_city_and_state('Middletown', 'CT')[0]; middletown_ct

SimpleZipcode(zipcode='06457', zipcode_type='Standard', major_city='Middletown', post_office_city='Middletown, CT', common_city_list=['Middletown'], county='Middlesex County', state='CT', lat=41.5, lng=-72.7, timezone='Eastern', radius_in_miles=7.0, area_code_list=['860'], population=47648, population_density=1162.0, land_area_in_sqmi=41.02, water_area_in_sqmi=1.35, housing_units=21223, occupied_housing_units=19863, median_home_value=234900, median_household_income=59994, bounds_west=-72.752941, bounds_east=-72.550945, bounds_north=41.60431, bounds_south=41.494838)

In [46]:
# convert middletown_ct to dictionary
middletown_ct = middletown_ct.to_dict(); middletown_ct

{'zipcode': '06457',
 'zipcode_type': 'Standard',
 'major_city': 'Middletown',
 'post_office_city': 'Middletown, CT',
 'common_city_list': ['Middletown'],
 'county': 'Middlesex County',
 'state': 'CT',
 'lat': 41.5,
 'lng': -72.7,
 'timezone': 'Eastern',
 'radius_in_miles': 7.0,
 'area_code_list': ['860'],
 'population': 47648,
 'population_density': 1162.0,
 'land_area_in_sqmi': 41.02,
 'water_area_in_sqmi': 1.35,
 'housing_units': 21223,
 'occupied_housing_units': 19863,
 'median_home_value': 234900,
 'median_household_income': 59994,
 'bounds_west': -72.752941,
 'bounds_east': -72.550945,
 'bounds_north': 41.60431,
 'bounds_south': 41.494838}

In [48]:
# what is the zip code for Middletown, CT?
print('The zip code for {} is {}.'.format(middletown_ct['post_office_city'], middletown_ct['zipcode']))

The zip code for Middletown, CT is 06457.


# _To-Do's for next exploration_

1. Figure out how to use `uszipcode` library to clean up zip code data
2. Reformat zip code data to five digit (i.e. no +4)

In [50]:
from datetime import datetime

# current date and time
end = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now, '\nWork ended: ', end, '\nTime worked:', (end-now))

Work started:  2019-10-03 09:06:22.915319 
Work ended:  2019-10-03 10:07:55.997210 
Time worked: 1:01:33.081891
