# _Physician Compare National: Explore #8_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal(s)_

1. Figure out how to either use `uszipcode` library or another technique to clean up zip code data
2. Reformat zip code data to five digit (i.e. no +4 digits)

In [1]:
from datetime import datetime

# current date and time
now = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now)

Work started:  2019-10-04 10:01:28.427318


In [1]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-2.csv', low_memory=False);

# inspect the first five rows
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,24440 STONE SPRINGS BLVD,N,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,1401 JOHNSTON WILLIS DR,N,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,411 W RANDOLPH RD,N,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


As a reminders below are the unique lengths of the zip codes and tha value counts for each length of the zip codes.

In [2]:
# what are the unique lengths of the zip codes?
print('Length of zip codes take the following values: {}'.format(sorted(data['zip'].str.len().unique())))

Length of zip codes take the following values: [3, 4, 5, 7, 8, 9]


In [3]:
# what are the value counts for each length of the zip codes?
data['zip'].str.len().value_counts()

9    2059140
8     132917
5      13060
7       3039
3       1710
4        924
Name: zip, dtype: int64

Additionally, we found that we could use city and state to to look up associated zip codes for that particular town. So lets see if we can put together some type of function that will extract the values of `cty` and `st` from `data`, place those values into `uszipcode`'s `SearchEngine` and extract the zipcode from the resulting converted dictionary.

In [34]:
# create subset of data where zip code is length 4
length_4 = data[data['zip'].str.len() == 4]; length_4.head(6)

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
102,1003003088,4981791837,I20071025000015,AVANI VORA,F,Not Listed,OTHER,2010,PHYSICAL THERAPY,,,,,,JERSEY PHYSICAL THERAPY ASSOCIATES LLC,547240300.0,6,3228 ROUTE 27,N,KENDALL PARK,NJ,8824,7322970000.0,,,,,,,,,,,Y
339,1003011065,4082888722,I20111115000827,LAURA M LEI,F,Not Listed,UMDNJ-NEW JERSEY MEDICAL SCHOOL,2007,ANESTHESIOLOGY,,,,,,ANESTHESIA CONSULTANTS OF NEW JERSEY LLC,3375449000.0,65,SOMERSET MED CTR 110 REHILL AVENUE,N,SOMERVILLE,NJ,8876,9086852000.0,310048.0,ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL - SOME...,,,,,,,,,Y
1183,1003030651,9739230780,I20161107000694,SALVA BILAL,F,Not Listed,OTHER,1996,FAMILY MEDICINE,,,,,,ELLIOT PROFESSIONAL SERVICES,6103728000.0,227,ONE ELLIOT WAY,N,MANCHESTER,NH,3103,6036634000.0,300012.0,ELLIOT HOSPITAL,300034.0,CATHOLIC MEDICAL CENTER,,,,,,,Y
1242,1003033390,1759474927,I20070904000383,BRIAN C CAMBI,M,Not Listed,STATE UNIVERSITY OF NEW YORK AT BUFFALO SCHOOL...,1999,INTERVENTIONAL CARDIOLOGY,CARDIOVASCULAR DISEASE (CARDIOLOGY),,,,CARDIOVASCULAR DISEASE (CARDIOLOGY),NORTHEAST MEDICAL GROUP INC,1254234000.0,1042,196 PKWY S SUITE 103,N,WATERFORD,CT,6385,8604434000.0,70007.0,LAWRENCE & MEMORIAL HOSPITAL,70022.0,YALE-NEW HAVEN HOSPITAL,410013.0,WESTERLY HOSPITAL,,,,,Y
1462,1003041427,6103953765,I20100430000016,LAURIE A. FORTY,F,Not Listed,OTHER,2007,NURSE PRACTITIONER,,,,,,"COMMUNITY HEALTH CARE, INC.",9335040000.0,42,251 BROAD ST,N,BRIDGETON,NJ,8302,8564531000.0,310032.0,INSPIRA MEDICAL CENTER VINELAND,,,,,,,,,Y
1578,1003043209,648424283,I20170517002241,MARJORIE E AFFEL,F,Not Listed,OTHER,2009,FAMILY MEDICINE,,,,,,"COMMUNITY HEALTH CARE, INC.",9335040000.0,42,251 BROAD ST,N,BRIDGETON,NJ,8302,8564531000.0,310081.0,INSPIRA MEDICAL CENTER WOODBURY,,,,,,,,,Y


In [5]:
# iterate through the rows of length_4 and print out the values for cty and st
for index, row in length_4.head().iterrows():
    print(row['cty'].capitalize(), row['st'])

Kendall park NJ
Somerville NJ
Manchester NH
Waterford CT
Bridgeton NJ


In [6]:
# load zip code library
from uszipcode import SearchEngine

# create search object for zip code look-up
search = SearchEngine(simple_zipcode=True)

In [13]:
# use the city and state from the first observation above - Kendall Park, NJ - to return information on zip code
search.by_city_and_state(city='Kendall park', state='NJ')[0].to_dict()

{'zipcode': '08824',
 'zipcode_type': 'Standard',
 'major_city': 'Kendall Park',
 'post_office_city': 'Kendall Park, NJ',
 'common_city_list': ['Kendall Park'],
 'county': 'Middlesex County',
 'state': 'NJ',
 'lat': 40.42,
 'lng': -74.55,
 'timezone': 'Eastern',
 'radius_in_miles': 2.0,
 'area_code_list': ['732'],
 'population': 12115,
 'population_density': 3124.0,
 'land_area_in_sqmi': 3.88,
 'water_area_in_sqmi': 0.0,
 'housing_units': 4002,
 'occupied_housing_units': 3935,
 'median_home_value': 396300,
 'median_household_income': 117863,
 'bounds_west': -74.585998,
 'bounds_east': -74.51603,
 'bounds_north': 40.441383,
 'bounds_south': 40.397933}

What we did above was to test a simple function that could loop through the first few rows of our `length_4` data subset, and extract the values for city and state. Additionally we found that by simply inputting a value into the city and state parameters of `uszipcode`'s `by_city_and_state` function, we could get a whole host of information of that particular location. Now the next step will be to combine the two together into a possible function that will more quickly clean up the `zip` column.

In [35]:
test_dict = {}
for index, row in length_4.head().iterrows():
    # extract city and state name
    cty = row['cty'].capitalize()
    state = row['st']
    # create dictionary to store city info
    city_zip = search.by_city_and_state(city=str(cty), state=str(state))[0].to_dict()
    test_dict[city_zip['post_office_city']] = city_zip['zipcode']

In [36]:
test_dict

{'Kendall Park, NJ': '08824',
 'Somerville, NJ': '08876',
 'Manchester, NH': '03101',
 'Waterford, CT': '06385',
 'Bridgeton, NJ': '08302'}

COOL! Looks like we have the foundation for function that can help us clean up the `zip` column. But first...let's see how we should clean up the values in the `cty` and `st` column to make them look more like the values above (i.e. not all caps). 

In [44]:
for index, row in length_4.head(10).iterrows():
    # extract city and state name
    cty = row['cty'].title()
    state = row['st']
    print(str(cty + ', ' + state).strip())

Kendall Park, NJ
Somerville, NJ
Manchester, NH
Waterford, CT
Bridgeton, NJ
Bridgeton, NJ
Mahwah, NJ
Mahwah, NJ
Mahwah, NJ
Concord, MA


After a relatively quick search, we found that we could accomplish the string cleaning with [`str.Title()`](https://www.geeksforgeeks.org/title-in-python/), which returns a string where the first letter in each word is uppercase and all remaining letters are lowercase.

Now the next step is to define this function formally and then apply it to our test data set, which will be `length_4`.

In [117]:
city_list = []
# loop through each row, exctracting city name and state
for index, row in length_4.iterrows():
    cty = row['cty'].title()
    state = row['st']
    # combine the two for full location name
    full_name = str(cty + ', ' + state).strip()
    # append to list
    city_list.append(full_name)

In [118]:
len(city_list)

924

In [119]:
# function to get unique values 
def unique(list1):
    # initialize a null list
    unique_list = []
    
    # traverse for all elements
    for x in list1:
        # check if exists in unique_list or not
        if x not in unique_list:
            unique_list.append(x)
    
    return unique_list

In [120]:
unique_cities = unique(city_list); len(unique_cities)

248

In [124]:
dictionary = {}

for city in unique_cities:
    cty, st = city.split(',')
    # strip any possible space from split
    cty = cty.strip()
    st = st.strip()
    # get zip code for city, state
    print(search.by_city_and_state(city=str(cty), state=str(st))[0].to_dict()['zipcode'])

TypeError: 'int' object is not iterable

In [102]:
city_dictionary = {}

for city in unique_cities:
    cty, st = city.split(',')
    # strip any possible space from split
    cty = cty.strip()
    st = st.strip()
    # get zip code for city, state
    city_zip = search.by_city_and_state(city=str(cty), state=str(st))[0].to_dict()
    dictionary[city] = city_zip['zipcode']

city_dictionary

IndexError: list index out of range

In [106]:
len(length_4)

924

In [116]:
dictionary = {}

for index, row in length_4.head(15).iterrows():
    cty = row['cty'].title()
    state = row['st']
    # combine the two for full location name
    full_name = str(cty + ', ' + state).strip()
    # get zip code
    cty_info = (search.by_city_and_state(city=cty, state=st)[0].to_dict()['zipcode'])
    print(cty_info)

IndexError: list index out of range

In [75]:
# define function that will extract rows city and state, search for its information with uszipcodes and then place the associated zip code
# into a dictionary 
d = {}

# loop through each row, exctracting city name and state
for index, row in length_4.iterrows():
    cty = row['cty'].title()
    state = row['st']
    # combine the two for full location name
    full_name = str(cty + ', ' + state).strip()
    if full_name not in d.keys():
        d.update({str(full_name): search.by_city_and_state(city=cty, state=state)[0].to_dict()['zipcode']})

'''
    # input above values into uszipcodes search engine
    city_info = search.by_city_and_state(city=str(cty), state=str(state))[0].to_dict()
    # input zipcode with city name as key
    if full_name not in d.keys():
        d[city_info['post_office_city']] = city_info['zipcode']
'''

IndexError: list index out of range

In [62]:
# loop through dictionary and try to append corresponding zip code to key
for key in d.keys():
    # store key (i.e. city name)
    str(key).split(','))
    #city_zip = search.by_city_and_state(city=str(cty), state=str(state))[0].to_dict()

['Kendall Park', ' NJ']
['Somerville', ' NJ']
['Manchester', ' NH']
['Waterford', ' CT']
['Bridgeton', ' NJ']
['Mahwah', ' NJ']
['Concord', ' MA']
['Oak Bluffs', ' MA']
['Cambridge', ' MA']
['Hyannis', ' MA']
['Marston Mills', ' MA']
['Mashpee', ' MA']
['Colts Neck', ' NJ']
['Cherry Hill', ' NJ']
['Camden', ' NJ']
['Bridgewater', ' NJ']
['Essex', ' CT']
['Milldale', ' CT']
['Scarborough', ' ME']
['Milford', ' MA']
['Secaucus', ' NJ']
['Taunton', ' MA']
['Pompton Plains', ' NJ']
['North Easton', ' MA']
['Wolfeboro', ' NH']
['Hackettstown', ' NJ']
['Elizabeth', ' PA']
['Waterville', ' ME']
['Liberty Corner', ' NJ']
['Morristown', ' NJ']
['Atlantic City', ' NJ']
['Salem', ' NJ']
['Barrington', ' NH']
['Clinton', ' NJ']
['Mountain Lks', ' NJ']
['Hiram', ' ME']
['Marlton', ' NJ']
['New Brunswick', ' NJ']
['Concord', ' NH']
['Westmont', ' NJ']
['Ridgewood', ' NJ']
['Presque Isle', ' ME']
['Arlington', ' VT']
['Portland', ' ME']
['Riverdale', ' NJ']
['Wall', ' NJ']
['Framingham', ' MA']
['Wor

In [125]:
!pip install zipcodes

Collecting zipcodes
[?25l  Downloading https://files.pythonhosted.org/packages/81/d1/b52c2d5bd93c8532f78cb2df688baa16cc121351ebd947274eabd944531d/zipcodes-1.0.5-py2.py3-none-any.whl (571kB)
[K     |████████████████████████████████| 573kB 571kB/s eta 0:00:01
[?25hInstalling collected packages: zipcodes
Successfully installed zipcodes-1.0.5


In [134]:
import zipcodes
from pprint import pprint

for index, row in length_4.head(15).iterrows():
    # extract city and state name
    cty = row['cty']
    st = row['st']
    zip_code = zipcodes.filter_by(city=cty, state=st)[0]['zip_code']
    print(zip_code)

08824
08876
03101
06385
08302
08302
07430
07430
07430
01742
02557
02138
02601


IndexError: list index out of range

In [138]:
length_4.iloc[12, 3].strip()

'MARY  CROWELL'

In [140]:
' '.join(length_4.iloc[12, 3].split()).strip()

'MARY CROWELL'

### _Valuable Sources From Today's Work_

- [`zipcodes`](https://github.com/seanpianka/Zipcodes) library
    - PyPi [Link](https://pypi.org/project/zipcodes/)
- [5 minute tutorial `uszipcode`](https://uszipcode.readthedocs.io/01-Tutorial/index.html)
- [Get unique values in a list](https://www.geeksforgeeks.org/python-get-unique-values-list/)
- [Strategy](https://daviseford.com/blog/2017/04/27/python-string-to-title-including-punctuation.html) for dealing with capitalization and strings
- [Iterating Through Rows with `pandas` iterrows()](https://cmdlinetips.com/2018/12/how-to-loop-through-pandas-rows-or-how-to-iterate-over-pandas-rows/)