# _Physician Compare National: Explore #10_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal(s)_

Further test functions designed yesterday to clean the zip code column; if a viable solution is developed, then I will implement on the full data set.

In [1]:
from datetime import datetime

# current date and time
now = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now)

Work started:  2019-10-08 09:07:05.612815


In [2]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-2.csv', low_memory=False);

# inspect the first five rows
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,24440 STONE SPRINGS BLVD,N,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,1401 JOHNSTON WILLIS DR,N,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,411 W RANDOLPH RD,N,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


In [3]:
# what are the unique lengths of the zip codes?
print('Length of zip codes take the following values: {}'.format(sorted(data['zip'].str.len().unique())))

Length of zip codes take the following values: [3, 4, 5, 7, 8, 9]


In [4]:
# what are the value counts for each length of the zip codes?
data['zip'].str.len().value_counts()

9    2059140
8     132917
5      13060
7       3039
3       1710
4        924
Name: zip, dtype: int64

## _Reimplement Zipcode Function from `explore-9`_

In [5]:
# sample 25% of original data
sample = data.sample(frac=0.25, random_state=1)

In [6]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 552698 entries, 1471535 to 1131096
Data columns (total 34 columns):
npi               552698 non-null int64
ind_pac_id        552698 non-null int64
ind_enrl_id       552698 non-null object
full_nm           552698 non-null object
gndr              552698 non-null object
cred              552698 non-null object
med_sch           552698 non-null object
grd_yr            552698 non-null int64
pri_spec          552698 non-null object
sec_spec_1        552698 non-null object
sec_spec_2        552698 non-null object
sec_spec_3        552698 non-null object
sec_spec_4        552698 non-null object
sec_spec_all      552698 non-null object
org_lgl_nm        552698 non-null object
org_pac_id        508216 non-null float64
num_org_mem       552698 non-null int64
full_adr          552698 non-null object
ln_2_sprs         552698 non-null object
cty               552698 non-null object
st                552698 non-null object
zip               552698

In [8]:
# what is the data type of the zip column in thousand?
sample['zip'].dtype

dtype('O')

In [9]:
# what are the lengths of the first five zip codes in the zip column?
for x in sample['zip'].head():
    print(len(str(x)))

9
9
9
9
9


In [10]:
# what are the unique lengths of the zip codes in the zip column
sorted(sample['zip'].astype(str).str.len().unique())

[3, 4, 5, 7, 8, 9]

In [11]:
# what are the value counts for the associated zip code lengths?
sample['zip'].astype(str).str.len().value_counts()

9    514907
8     33060
5      3272
7       760
3       443
4       256
Name: zip, dtype: int64

## _Create `full_location` column in `sample`_

In [12]:
%%time
# create full_location by combining cty and st
full_location = sample['cty'] + ', ' + sample['st']

CPU times: user 101 ms, sys: 34.1 ms, total: 135 ms
Wall time: 207 ms


In [13]:
# print out the columns and their associated index
for i, v in enumerate(list(sample.columns)):
    print('Column index: {} Column name: {}'.format(i, v))

Column index: 0 Column name: npi
Column index: 1 Column name: ind_pac_id
Column index: 2 Column name: ind_enrl_id
Column index: 3 Column name: full_nm
Column index: 4 Column name: gndr
Column index: 5 Column name: cred
Column index: 6 Column name: med_sch
Column index: 7 Column name: grd_yr
Column index: 8 Column name: pri_spec
Column index: 9 Column name: sec_spec_1
Column index: 10 Column name: sec_spec_2
Column index: 11 Column name: sec_spec_3
Column index: 12 Column name: sec_spec_4
Column index: 13 Column name: sec_spec_all
Column index: 14 Column name: org_lgl_nm
Column index: 15 Column name: org_pac_id
Column index: 16 Column name: num_org_mem
Column index: 17 Column name: full_adr
Column index: 18 Column name: ln_2_sprs
Column index: 19 Column name: cty
Column index: 20 Column name: st
Column index: 21 Column name: zip
Column index: 22 Column name: phn_numbr
Column index: 23 Column name: hosp_afl_1
Column index: 24 Column name: hosp_afl_lbn_1
Column index: 25 Column name: hosp

In [14]:
# insert full location after st column (index = 21)
sample.insert(loc=21, column='full_location', value=full_location)

In [15]:
# see if insertion worked
sample.head(2)

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,full_location,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
1471535,1669574299,6103910880,I20070926000346,JYOTHI JOLEPALEM,M,Not Listed,OTHER,1988,PULMONARY DISEASE,CRITICAL CARE (INTENSIVISTS),,,,CRITICAL CARE (INTENSIVISTS),,,1,3 ERIE CT SUITE 7010,N,OAK PARK,IL,"OAK PARK, IL",603022519,,140049.0,WEST SUBURBAN MEDICAL CENTER,140082.0,LOUIS A WEISS MEMORIAL HOSPITAL,140251.0,COMMUNITY FIRST MEDICAL CENTER,140240.0,WESTLAKE COMMUNITY HOSPITAL,,,M
965233,1437367109,3779656590,I20080725000695,LISA A GOMBITA,F,Not Listed,OTHER,2006,NURSE PRACTITIONER,,,,,,ONCOLOGY HEMATOLOGY ASSOCIATION INC,9830003000.0,110,1600 CORAOPOLIS HEIGHTS RD F,N,CORAOPOLIS,PA,"CORAOPOLIS, PA",151084307,4123292000.0,390114.0,MAGEE WOMENS HOSPITAL OF UPMC HEALTH SYSTEM,390037.0,HERITAGE VALLEY SEWICKLEY,390036.0,HERITAGE VALLEY BEAVER,390157.0,OHIO VALLEY GENERAL HOSPITAL,390107.0,UPMC PASSAVANT,Y


In [20]:
# import libraries
from uszipcode import SearchEngine, SimpleZipcode, Zipcode

# initialize zip code search engine object
search = SearchEngine()

# function developed to generate zip codes in format best ready for analysis
def zip_generator(row):
    if len(row['zip']) == 9:
        return row['zip'][:5]
        pass
    elif len(row['zip']) == 5:
        return row['zip']
        pass
    else:
        # split observation into cty and st
        cty, st = [x.strip() for x in row['full_location'].split(', ')]
        # use searchengine to look up cty and state
        lookup = search.by_city_and_state(city=cty, state=st)
        if lookup == []:
            return row['zip']
        else:
            zipcode = lookup[0].values()[0]
            #time.sleep(random.uniform(0, 0.25))
            return zipcode

### _Test Function_

In [17]:
# create a copy of data set that we can test on
tester = sample.copy()

In [18]:
# test to see if the above function works with apply
tester.head(5).apply(lambda row: zip_generator(row), axis=1)

1471535    60302
965233     15108
582461     59405
1070284    43205
1153666    33133
dtype: object

In [21]:
%%time 
# let's test it now on the first 100 hundreds rows and time it
# test to see if the above function works with apply
tester.head(100).apply(lambda row: zip_generator(row), axis=1);

CPU times: user 526 ms, sys: 46.1 ms, total: 572 ms
Wall time: 826 ms


1471535    60302
965233     15108
582461     59405
1070284    43205
1153666    33133
543951     43351
323086     27910
2200522    18301
1939920    40517
1793826    94063
1406228    74501
1510661    01803
2156628    55455
1275864    86023
797013     85209
1513891    12208
1669106    00757
2079948    04102
1680837    29902
2073754    46219
1927017    92108
113232     60026
781935     34287
1545566    29072
278760     68769
1588833    84094
750783     67002
2120505    72903
1215491    85032
1288042    75226
           ...  
913707     75390
988340     60637
1463853    64064
732622     98026
569075     27705
1327785    37411
2072525    89406
1439192    19810
1953633    67950
1602335    94063
1282539    32803
2009064    70726
189574     38501
1951912    08077
1567970    10801
506324     37918
889850     07601
351808     66160
78525      48109
1577154    15701
374263     89074
341096     30236
1252488    98201
5084       38671
906734     33486
408499     32504
819552     44195
2105003    300

In [22]:
# what is the shape of our tester dataframe?
tester.shape

(552698, 35)

In [29]:
%%time
# test out by creating a new column on whole tester dataset
tester['new_zipcode'] = tester.apply(lambda row: zip_generator(row), axis=1);

KeyError: ('VI', 'occurred at index 385599')

Uh-oh ran into an error! It looks like a value of `VI` resulted in a `KeyError`. Now I remember that we have observations that were in places like Puerto Rico and the U.S. Virgin Islands; we'll have to remove these observations for now.

In [32]:
# how many states are there in the sample?
len(sample['st'].astype(str).unique())

55

There are `55` states in our state column. We'll need to investigate further to determine which ones we need to drop.

In [31]:
# what are the value counts for the associated zip code lengths?
sample['st'].astype(str).value_counts()

CA    45762
NY    36855
TX    35635
FL    30347
PA    29343
MI    26057
IL    23238
OH    22069
NC    20454
MA    16954
WI    16034
MN    15826
NJ    14738
GA    14101
WA    13783
VA    13389
IN    12443
MD    12154
MO    10931
TN    10903
AZ    10578
CO     9283
KY     7922
OR     7840
CT     7566
SC     7007
LA     6514
IA     6028
AL     5783
KS     5517
OK     5115
UT     4457
MS     4020
ME     3531
AR     3491
NE     3473
ID     3265
NV     2905
WV     2893
NH     2845
NM     2781
DE     2384
SD     2035
RI     1992
MT     1980
ND     1828
HI     1795
PR     1683
DC     1539
AK     1258
VT     1257
WY      957
GU       88
VI       52
MP       20
Name: st, dtype: int64

There are a few state abbreviations that seem different: `PR`, `GU`, `VI`, `MP` and `DC`.

- `PR` : Puerto Rico
- `GU` : Guam
- `VI` : U.S. Virgin Islands
- `MP` : Northern Marianas
- `DC` : District of Columbia

Since we're focused primarily on the 50 states, we'll have to drop the observations which have the abbreviations mentioned above. But before we do that, let's take a look at `DC` to see if the `uszipcode` library can look up locations in the District of Columbia.

In [33]:
# initialize zip code search engine object
search = SearchEngine()

In [34]:
# are we able to look up observations located in Washington, DC?
search.by_city_and_state(city='Washington', state='DC')

[SimpleZipcode(zipcode='20001', zipcode_type='Standard', major_city='Washington', post_office_city='Washington, DC', common_city_list=['Washington'], county='District of Columbia', state='DC', lat=38.91, lng=-77.02, timezone='Eastern', radius_in_miles=2.0, area_code_list=['202'], population=38551, population_density=17689.0, land_area_in_sqmi=2.18, water_area_in_sqmi=0.06, housing_units=18751, occupied_housing_units=16500, median_home_value=495300, median_household_income=78848, bounds_west=-77.028292, bounds_east=-77.007177, bounds_north=38.929279, bounds_south=38.89071),
 SimpleZipcode(zipcode='20002', zipcode_type='Standard', major_city='Washington', post_office_city='Washington, DC', common_city_list=['Washington'], county='District of Columbia', state='DC', lat=38.91, lng=-76.98, timezone='Eastern', radius_in_miles=2.0, area_code_list=['202'], population=52370, population_density=9961.0, land_area_in_sqmi=5.26, water_area_in_sqmi=0.22, housing_units=26166, occupied_housing_units=2

Looks like `uszipcode` includes Washington, DC in its lookup! This means we'll technically have 51 unique values in the `st` column.

So let's drop the observations with `PR`, `GU`, `VI` and `MP` so we can retry our function.

In [51]:
drop_index = []

for val in ['PR', 'GU', 'VI', 'MP']:
    # gather indexs of observations with val
    indexes = list(tester[tester['st'] == val].index)
    drop_index.append(indexes)

In [52]:
len(drop_index)

4

In [55]:
# extract first five indexes from first list in drop_index
drop_index[0][:5]

[1669106, 989051, 2127358, 553874, 160340]

In [56]:
# since drop_index is a list of lists, lets flatten it then sort it by the values
flat_index = [item for sublist in drop_index for item in sublist]

**Resource**: [How to flatten a list](https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists)

In [58]:
# how long is our flattened list?
len(flat_index)

1843

In [59]:
# let's take a look at the first few observations
flat_index[:10]

[1669106,
 989051,
 2127358,
 553874,
 160340,
 331296,
 1355872,
 215634,
 584436,
 1859962]

In [60]:
# we need to sort flat_index
flat_index = sorted(flat_index); flat_index[:10]

[2468, 3106, 4698, 8053, 11178, 11378, 11907, 12400, 14959, 16904]

In [62]:
# with our list of indexes, let's drop these observations
tester.drop(flat_index)['st'].value_counts()

CA    45762
NY    36855
TX    35635
FL    30347
PA    29343
MI    26057
IL    23238
OH    22069
NC    20454
MA    16954
WI    16034
MN    15826
NJ    14738
GA    14101
WA    13783
VA    13389
IN    12443
MD    12154
MO    10931
TN    10903
AZ    10578
CO     9283
KY     7922
OR     7840
CT     7566
SC     7007
LA     6514
IA     6028
AL     5783
KS     5517
OK     5115
UT     4457
MS     4020
ME     3531
AR     3491
NE     3473
ID     3265
NV     2905
WV     2893
NH     2845
NM     2781
DE     2384
SD     2035
RI     1992
MT     1980
ND     1828
HI     1795
DC     1539
AK     1258
VT     1257
WY      957
Name: st, dtype: int64

In [64]:
# how many unique values are in the state column once we drop according to flat_index?
len(tester.drop(flat_index)['st'].astype(str).unique())

51

Looks good! Lets apply this now to our `tester` DataFrame, after which we can apply our zipcode function!

In [65]:
# drop indexes with state abbreviations we're not focused on
tester = tester.drop(flat_index)

In [67]:
%%time
# second test of zip_generator function
tester['new_zipcode'] = tester.apply(lambda row: zip_generator(row), axis=1);

ValueError: ('too many values to unpack (expected 2)', 'occurred at index 286572')

In [70]:
# look up observation at index 286572 
tester.loc[286572]

npi                                                      1124580857
ind_pac_id                                               4688908700
ind_enrl_id                                         I20190624002728
full_nm                                              ELINOR M KURZ 
gndr                                                              F
cred                                                     Not Listed
med_sch                                                       OTHER
grd_yr                                                         2018
pri_spec                                         NURSE PRACTITIONER
sec_spec_1                                                     None
sec_spec_2                                                     None
sec_spec_3                                                     None
sec_spec_4                                                     None
sec_spec_all                                                   None
org_lgl_nm                                      

In [69]:
# grab full_location from above observation
tester.loc[286572, 'full_location']

'KITTERY, ME 03904, ME'

Looks like we have a typo for this observation! For value of `cty`, it looks like it was entered in with all the location information (city, state, and zip code). Let's explore how we can address this specific observation, and see if there are any other observations that may be entered in similarly.

In [76]:
tester.loc[286572, 'full_location'].find(',')

7

In [77]:
tester.loc[286572, 'cty'].find(',')

7

In [74]:
if tester.loc[286572, 'full_location'].find(',') > 0:
    print('True')

True


In [80]:
for city in tester['cty'].head():
    if city.find(',') > 0:
        print('Comma in cty value')
    else:
        print('No comma found in cty value')

No comma found in cty value
No comma found in cty value
No comma found in cty value
No comma found in cty value
No comma found in cty value


In [85]:
import re

re.findall(r',', tester.loc[286572, 'full_location'])

[',', ',']

In [101]:
%%time
# for loop that will check for other observations that have 2 or more commas
for index, row in tester.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

Full location with at least two commas found at 332637
Full location with at least two commas found at 415837
Full location with at least two commas found at 1551245
Full location with at least two commas found at 286572
Full location with at least two commas found at 2195532
CPU times: user 50.8 s, sys: 856 ms, total: 51.6 s
Wall time: 54.5 s


Our `for` loop above found five observations with at least two columns. However, one thing we should be sure to take into account is that it took nearly a minute to iterate over all the rows. While not all that long, we do have to consider that this particular strategy may run into some issues if we were to try it on the whole data set.

Let's go ahead and explore these five observations to see what is going on.

In [102]:
%%time
# empty list to store indexes with more than 2 commas
commas2 = []

# for loop that will check for other observations that have 2 or more commas
for index, row in tester.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        commas2.append(index)
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

Full location with at least two commas found at 332637
Full location with at least two commas found at 415837
Full location with at least two commas found at 1551245
Full location with at least two commas found at 286572
Full location with at least two commas found at 2195532
CPU times: user 52.2 s, sys: 668 ms, total: 52.9 s
Wall time: 57 s


In [103]:
# check out the list we just created
commas2

[332637, 415837, 1551245, 286572, 2195532]

In [107]:
# loop through and print out full location
for idx in commas2:
    print(tester['full_location'][idx])

NEW YORK,, NY
WASHINGTON, DC 20037, DC
RIDGEWOOD, NEW JERSEY, NJ
KITTERY, ME 03904, ME
ALEXANDRIA,, IN


In [121]:
# for loop that splits tring on commas and then returns string in correct format
for idx in commas2:
    one, two, three = tester['full_location'][idx].split(',')
    new_string = (one + ',' + three)
    tester['full_location'][idx] = new_string

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [122]:
# loop through values and see if full_locations were updated
for idx in commas2:
    print(tester['full_location'][idx])

NEW YORK, NY
WASHINGTON, DC
RIDGEWOOD, NJ
KITTERY, ME
ALEXANDRIA, IN


In [124]:
# the issue stemmed from the value in the cty column, let's loop through those to see what they look like
for idx in commas2:
    print(tester['cty'][idx])

NEW YORK,
WASHINGTON, DC 20037
RIDGEWOOD, NEW JERSEY
KITTERY, ME 03904
ALEXANDRIA,


In [128]:
for idx in commas2:
    cty, st = tester['full_location'][idx].split(',')
    tester['cty'][idx] = cty
    print('Updated cty column for index {}'.format(idx))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Updated cty column for index 332637
Updated cty column for index 415837
Updated cty column for index 1551245
Updated cty column for index 286572
Updated cty column for index 2195532


In [130]:
# loop through cty observations to make sure they've been updated
for idx in commas2:
    print(tester['cty'][idx])

NEW YORK
WASHINGTON
RIDGEWOOD
KITTERY
ALEXANDRIA


In [132]:
# for loop that will check for other observations that have 2 or more commas (TO CONFIRM NO MORE)
for index, row in tester.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

### _Second Trial Run of `zip_generator`_

In [133]:
%%time
# second test of zip_generator function
tester['new_zipcode'] = tester.apply(lambda row: zip_generator(row), axis=1);

CPU times: user 1h 2min 41s, sys: 1min 23s, total: 1h 4min 5s
Wall time: 1h 5min 45s


In [136]:
# check to see what the new lengths are in the new zipcode column
tester['new_zipcode'].str.len().value_counts()

5    550424
8       423
4         7
3         1
Name: new_zipcode, dtype: int64

For the most part, it looked like it worked. There appears to be some potential edge cases that we didn't account for though (hence the values of `8`, `4`, and `3`). However, I am going to call it a night so this will be something to look into tomorrow! 