# _Physician Compare National: Explore #12_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal(s)_

This isn't my primary focus today; I would like to continue my work with the Twitter project. However, based on the last test of the `zip_generator` function in the `explore-11` Jupyter notebook, it looks like the function might be ready for prime-time. This will be the first trial on the entire physician compare dataset of ~2.21M rows of data. 

In [1]:
from datetime import datetime

# current date and time
now = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now)

Work started:  2019-10-14 11:45:55.567409


In [1]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-2.csv', low_memory=False);

# inspect the first five rows
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,24440 STONE SPRINGS BLVD,N,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,1401 JOHNSTON WILLIS DR,N,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,411 W RANDOLPH RD,N,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


# _Drop Observations Not Located in one of the 50 states_

In [2]:
# create list to store indexes
drop_index = []

# loop through state abbreviations of locations that aren't one of the 50 states
for val in ['PR', 'GU', 'VI', 'MP']:
    # gather indexs of observations with val
    indexes = list(data[data['st'] == val].index)
    drop_index.append(indexes)

In [3]:
# how long is the drop_index list?
len(drop_index)

4

In [4]:
# extract first five indexes from first list in drop_index
drop_index[0][:5]

[251, 252, 271, 682, 959]

In [5]:
# since drop_index is a list of lists, lets flatten it then sort it by the values
flat_index = [item for sublist in drop_index for item in sublist]

In [6]:
# how long is our flattened list?
len(flat_index)

7169

In [7]:
# let's take a look at the first few observations
flat_index[:10]

[251, 252, 271, 682, 959, 1882, 2278, 2279, 2468, 2717]

In [8]:
# we need to sort flat_index
flat_index = sorted(flat_index); flat_index[:10]

[251, 252, 271, 682, 959, 1882, 2278, 2279, 2468, 2717]

In [9]:
# with our list of indexes, let's drop these observations
data.drop(flat_index)['st'].value_counts()

CA    182214
NY    147143
TX    141947
FL    122414
PA    118094
MI    104868
IL     93257
OH     88951
NC     81437
MA     67807
WI     64466
MN     63019
NJ     59030
GA     56316
WA     55258
VA     53381
IN     49461
MD     48677
TN     44113
MO     42797
AZ     41899
CO     37223
KY     31497
OR     30908
CT     30019
SC     28161
LA     26278
IA     23996
AL     23056
KS     21571
OK     20610
UT     18180
MS     15990
ME     14378
AR     14080
NE     13699
ID     13094
NV     11747
WV     11691
NH     11608
NM     11418
DE      9656
RI      8133
SD      8062
MT      7739
ND      7286
HI      6986
DC      6151
AK      5139
VT      5077
WY      3639
Name: st, dtype: int64

In [10]:
# how many unique values are in the state column once we drop according to flat_index?
len(data.drop(flat_index)['st'].astype(str).unique())

51

Looks good! Lets apply this now to our `data` DataFrame, after which we can apply our zipcode function!

In [11]:
# drop indexes with state abbreviations we're not focused on
data2 = data.drop(flat_index)

## _Insert (& Clean) `full_location` column_

In [12]:
%%time
# create full_location by combining cty and st
full_location = data2['cty'] + ', ' + data2['st']

CPU times: user 369 ms, sys: 93.8 ms, total: 463 ms
Wall time: 627 ms


In [13]:
# print out the columns and their associated index
for i, v in enumerate(list(data2.columns)):
    print('Column index: {} Column name: {}'.format(i, v))

Column index: 0 Column name: npi
Column index: 1 Column name: ind_pac_id
Column index: 2 Column name: ind_enrl_id
Column index: 3 Column name: full_nm
Column index: 4 Column name: gndr
Column index: 5 Column name: cred
Column index: 6 Column name: med_sch
Column index: 7 Column name: grd_yr
Column index: 8 Column name: pri_spec
Column index: 9 Column name: sec_spec_1
Column index: 10 Column name: sec_spec_2
Column index: 11 Column name: sec_spec_3
Column index: 12 Column name: sec_spec_4
Column index: 13 Column name: sec_spec_all
Column index: 14 Column name: org_lgl_nm
Column index: 15 Column name: org_pac_id
Column index: 16 Column name: num_org_mem
Column index: 17 Column name: full_adr
Column index: 18 Column name: ln_2_sprs
Column index: 19 Column name: cty
Column index: 20 Column name: st
Column index: 21 Column name: zip
Column index: 22 Column name: phn_numbr
Column index: 23 Column name: hosp_afl_1
Column index: 24 Column name: hosp_afl_lbn_1
Column index: 25 Column name: hosp

In [14]:
# insert full location after st column (index = 21)
data2.insert(loc=21, column='full_location', value=full_location)

In [15]:
# see if insertion worked
data2.head(2)

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,full_location,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,"RESTON, VA",201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,"ARLINGTON, VA",222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


Now that we've inserted the `full_location` column, we need to clean it up a little bit. As you will see below, the combining of the columns didn't work for some of the observations because there were typos in the original columns (i.e. `cty` and `st`). We discovered this in our previous notebook and will address and update the data accordingly.

In [16]:
# example of a full_location with typo
data2.loc[286572, 'full_location']

'KITTERY, ME 03904, ME'

We're going to import the `re` library which will help us deal with the strings in this column. Below is an example of `re`'s `findall` function, which in this case looks for all the commas in `full_location` column for the observation at index `286572`. As we can see from the output above, it correctly outputs two commas in a list.

In [17]:
import re

re.findall(r',', data2.loc[286572, 'full_location'])

[',', ',']

Now that we have the tools necessary, lets loop through this whole column and see where there are other 'messy' `full_location` entries.

In [18]:
%%time
# for loop that will check for other observations that have 2 or more commas
for index, row in data2.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

Full location with at least two commas found at 111125
Full location with at least two commas found at 267029
Full location with at least two commas found at 286572
Full location with at least two commas found at 332637
Full location with at least two commas found at 415837
Full location with at least two commas found at 840123
Full location with at least two commas found at 920591
Full location with at least two commas found at 1059997
Full location with at least two commas found at 1135187
Full location with at least two commas found at 1238978
Full location with at least two commas found at 1551245
Full location with at least two commas found at 1645471
Full location with at least two commas found at 1705880
Full location with at least two commas found at 1812563
Full location with at least two commas found at 1820194
Full location with at least two commas found at 2024083
Full location with at least two commas found at 2158428
Full location with at least two commas found at 2195532

Looks like we have a few typos to address; instead of entering in the indexes manually, let's do the same loop as above but with a slight tweak. We'll append these index values to a list so that it makes it a little easier to look them up.

In [19]:
%%time
# empty list to store indexes with more than 2 commas
commas2 = []

# for loop that will check for other observations that have 2 or more commas
for index, row in data2.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        commas2.append(index)
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

Full location with at least two commas found at 111125
Full location with at least two commas found at 267029
Full location with at least two commas found at 286572
Full location with at least two commas found at 332637
Full location with at least two commas found at 415837
Full location with at least two commas found at 840123
Full location with at least two commas found at 920591
Full location with at least two commas found at 1059997
Full location with at least two commas found at 1135187
Full location with at least two commas found at 1238978
Full location with at least two commas found at 1551245
Full location with at least two commas found at 1645471
Full location with at least two commas found at 1705880
Full location with at least two commas found at 1812563
Full location with at least two commas found at 1820194
Full location with at least two commas found at 2024083
Full location with at least two commas found at 2158428
Full location with at least two commas found at 2195532

In [20]:
# loop through and print out full location
for idx in commas2:
    print(data2['full_location'][idx])

MONROE, LA, LA
VALLEY STREAM, NY, NY
KITTERY, ME 03904, ME
NEW YORK,, NY
WASHINGTON, DC 20037, DC
RINDGE, NH, NH
CENTENNIAL,, CO
NASHVILLE,, TN
WASHINGTON, DC, DC
RINDGE, NH, NH
RIDGEWOOD, NEW JERSEY, NJ
WEST CHESTER, OH, OH
DESOTO, TX, TX
MODESTO,, CA
APOPKA,, FL
NASHVILLE,, TN
LITTLE ROCK, AR, AR
ALEXANDRIA,, IN


In [21]:
# loop through to test potential strategy to clean up full location
for idx in commas2:
    one, two, three = data2['full_location'][idx].split(',')
    new_string = (one + ',' + three).strip()
    print(new_string)

MONROE, LA
VALLEY STREAM, NY
KITTERY, ME
NEW YORK, NY
WASHINGTON, DC
RINDGE, NH
CENTENNIAL, CO
NASHVILLE, TN
WASHINGTON, DC
RINDGE, NH
RIDGEWOOD, NJ
WEST CHESTER, OH
DESOTO, TX
MODESTO, CA
APOPKA, FL
NASHVILLE, TN
LITTLE ROCK, AR
ALEXANDRIA, IN


In [22]:
# for loop that splits tring on commas and then returns string in correct format
for idx in commas2:
    one, two, three = data2['full_location'][idx].split(',')
    new_string = (one + ',' + three)
    data2['full_location'][idx] = new_string

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [23]:
# loop through values and see if full_locations were updated
for idx in commas2:
    print(data2['full_location'][idx])

MONROE, LA
VALLEY STREAM, NY
KITTERY, ME
NEW YORK, NY
WASHINGTON, DC
RINDGE, NH
CENTENNIAL, CO
NASHVILLE, TN
WASHINGTON, DC
RINDGE, NH
RIDGEWOOD, NJ
WEST CHESTER, OH
DESOTO, TX
MODESTO, CA
APOPKA, FL
NASHVILLE, TN
LITTLE ROCK, AR
ALEXANDRIA, IN


In [24]:
# the issue stemmed from the value in the cty column, let's loop through those to see what they look like
for idx in commas2:
    print(data2['cty'][idx])

MONROE, LA
VALLEY STREAM, NY
KITTERY, ME 03904
NEW YORK,
WASHINGTON, DC 20037
RINDGE, NH
CENTENNIAL,
NASHVILLE,
WASHINGTON, DC
RINDGE, NH
RIDGEWOOD, NEW JERSEY
WEST CHESTER, OH
DESOTO, TX
MODESTO,
APOPKA,
NASHVILLE,
LITTLE ROCK, AR
ALEXANDRIA,


In [25]:
# if we split on the comma, it should be pretty straightforward to extract the city and state
for idx in commas2:
    cty, st = data2['full_location'][idx].split(',')
    print(cty, st)

MONROE  LA
VALLEY STREAM  NY
KITTERY  ME
NEW YORK  NY
WASHINGTON  DC
RINDGE  NH
CENTENNIAL  CO
NASHVILLE  TN
WASHINGTON  DC
RINDGE  NH
RIDGEWOOD  NJ
WEST CHESTER  OH
DESOTO  TX
MODESTO  CA
APOPKA  FL
NASHVILLE  TN
LITTLE ROCK  AR
ALEXANDRIA  IN


In [26]:
# apply the above function to our dataset
for idx in commas2:
    cty, st = data2['full_location'][idx].split(',')
    data2['cty'][idx] = cty
    print('Updated cty column for index {}'.format(idx));

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Updated cty column for index 111125
Updated cty column for index 267029
Updated cty column for index 286572
Updated cty column for index 332637
Updated cty column for index 415837
Updated cty column for index 840123
Updated cty column for index 920591
Updated cty column for index 1059997
Updated cty column for index 1135187
Updated cty column for index 1238978
Updated cty column for index 1551245
Updated cty column for index 1645471
Updated cty column for index 1705880
Updated cty column for index 1812563
Updated cty column for index 1820194
Updated cty column for index 2024083
Updated cty column for index 2158428
Updated cty column for index 2195532


In [27]:
# loop through cty observations to make sure they've been updated
for idx in commas2:
    print(data2['cty'][idx])

MONROE
VALLEY STREAM
KITTERY
NEW YORK
WASHINGTON
RINDGE
CENTENNIAL
NASHVILLE
WASHINGTON
RINDGE
RIDGEWOOD
WEST CHESTER
DESOTO
MODESTO
APOPKA
NASHVILLE
LITTLE ROCK
ALEXANDRIA


In [28]:
# for loop that will check for other observations that have 2 or more commas (TO CONFIRM NO MORE)
for index, row in data2.iterrows():
    comma_check = re.findall(r',', row['full_location'])
    if len(comma_check) >= 2:
        print('Full location with at least two commas found at {}'.format(index))
    else:
        pass

# _Save Updates to New CSV file_

We want to make the changes permanent so we'll save the current `data2` `pandas` DataFrame to a CSV so we can just load it in next time without having to go through the above steps!

In [41]:
# save data set to new CSV file
data2.to_csv('physician_compare_national-updates-3.csv', index=False)

# _`zip_generator` Function & Application_

In a previous notebook we developed a `zip_generator` function that would go through and assess the `zip` code column. This column is critically important as it's crucial to have this right if we want to accurately graph the distribution of physicians. Let's define it below to see what it looks like.

In [29]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data2 = pd.read_csv('physician_compare_national-updates-3.csv', low_memory=False);

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
# import from uszipcode library its SearchEngine and zip code databases
from uszipcode import SearchEngine, SimpleZipcode, Zipcode

# initialize zip code search engine object
search = SearchEngine()

# function developed to generate zip codes in format best ready for analysis
def zip_generator(row):
    if len(row['zip']) == 9: # if the zip is 9 characters long (meaning it has a +4) we'll grab just the first five
        return row['zip'][:5]
        pass
    elif len(row['zip']) == 5: # if the zip is 5 characters long, we'll leave as is
        return row['zip']
        pass
    else: # for anything else we'll look up the zip code based on the city and state
        # split observation into cty and st
        cty, st = [x.strip() for x in row['full_location'].split(', ')]
        # use searchengine to look up cty and state
        lookup = search.by_city_and_state(city=cty, state=st)
        if lookup == []:
            return row['zip']
        else:
            zipcode = lookup[0].values()[0]
            #time.sleep(random.uniform(0, 0.25))
            return zipcode

Now this is going to take awhile; when I tested it on a 25% sample of the data, it took the above function a little over an hour to process the column so for the full data set, we're looking at a few hours unfortunately. However, I want to highlight that this is a necessary step as we want to ensure that our location data is as accurate as possible! This isn't to say that it's the most efficient method available but it is the most efficient one that we have available to us.

So without further ado, let's apply it and see what happens!

### _First Trial Run of `zip_generator` on entire dataset_

In [31]:
%%time
# second test of zip_generator function
data2['new_zipcode'] = data2.apply(lambda row: zip_generator(row), axis=1);

CPU times: user 3h 48min 44s, sys: 4min 19s, total: 3h 53min 3s
Wall time: 3h 54min 19s


In [45]:
# what are the value counts for the associated zip code lengths?
data2['new_zipcode'].astype(str).str.len().value_counts()

5    2201882
8       1701
4         36
3          2
Name: new_zipcode, dtype: int64

Looks like we've covered the vast majority of the observations! However, there still are a few edge cases we need to deal with. We'll address these in the next notebook though so lets be sure we save our updates to a new CSV so we don't have to wait ~4 hrs again!

In [78]:
data2.to_csv('physician_compare_national-updates-4.csv', index=False);