# _Physician Compare National: Explore #10_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal(s)_

Keep working with how to best clean up the zip code column. Now, why am I spending so much time on this particular column? Zip codes are probably going to be our most important geoindicator for this project. We'll use this value to place each observation into the correct location, which will then be used to (hopefully) generate a wide-range of interactive visualizations.

In [1]:
from datetime import datetime

# current date and time
now = datetime.now()

# timestamp to signify the beginning of work
print("Work started: ", now)

Work started:  2019-10-07 09:26:23.961108


In [2]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-2.csv', low_memory=False);

# inspect the first five rows
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,24440 STONE SPRINGS BLVD,N,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,1401 JOHNSTON WILLIS DR,N,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133,411 W RANDOLPH RD,N,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


In [75]:
# what are the data types in each column?
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2210790 entries, 0 to 2210789
Data columns (total 34 columns):
npi               int64
ind_pac_id        int64
ind_enrl_id       object
full_nm           object
gndr              object
cred              object
med_sch           object
grd_yr            int64
pri_spec          object
sec_spec_1        object
sec_spec_2        object
sec_spec_3        object
sec_spec_4        object
sec_spec_all      object
org_lgl_nm        object
org_pac_id        float64
num_org_mem       int64
full_adr          object
ln_2_sprs         object
cty               object
st                object
zip               object
phn_numbr         float64
hosp_afl_1        object
hosp_afl_lbn_1    object
hosp_afl_2        object
hosp_afl_lbn_2    object
hosp_afl_3        object
hosp_afl_lbn_3    object
hosp_afl_4        object
hosp_afl_lbn_4    object
hosp_afl_5        float64
hosp_afl_lbn_5    object
assgn             object
dtypes: float64(3), int64(4), object(2

## _One Last Attempt at Trying to Look Up Zip Codes_

In [77]:
# extract the first 1000 rows
thousand = data[:1000]

In [78]:
# what is the data type of the zip column in thousand?
thousand['zip'].dtype

dtype('O')

In [79]:
# what are the lengths of the first five zip codes in the zip column?
for x in thousand['zip'].head():
    print(len(str(x)))

9
9
9
9
9


In [82]:
# what are the unique lengths of the zip codes in the zip column
sorted(thousand['zip'].astype(str).str.len().unique())

[4, 5, 7, 8, 9]

In [83]:
# what are the value counts for the associated zip code lengths?
thousand['zip'].astype(str).str.len().value_counts()

9    918
8     66
5      9
7      5
4      2
Name: zip, dtype: int64

### _Create `full_location` Column_

We'll combine values of `cty` and `st` to create a new column that has the full location.

In [91]:
# what are the data types for each column?
print(thousand['cty'].dtype)
print(thousand['st'].dtype)

object
object


In [84]:
# check out the first five cty
thousand['cty'].head()

0                RESTON
1             ARLINGTON
2                DULLES
3    NORTH CHESTERFIELD
4              HOPEWELL
Name: cty, dtype: object

In [85]:
# what are the first five states?
thousand['st'].head()

0    VA
1    VA
2    VA
3    VA
4    VA
Name: st, dtype: object

In [90]:
# how can we combine them?
print(thousand['cty'].head() + thousand['st'].head())
print('-' * 30)
print(thousand['cty'].head() + ' ' + thousand['st'].head())
print('-' * 30)
print(thousand['cty'].head() + ', ' + thousand['st'].head())
print('-' * 30)

0                RESTONVA
1             ARLINGTONVA
2                DULLESVA
3    NORTH CHESTERFIELDVA
4              HOPEWELLVA
dtype: object
------------------------------
0                RESTON VA
1             ARLINGTON VA
2                DULLES VA
3    NORTH CHESTERFIELD VA
4              HOPEWELL VA
dtype: object
------------------------------
0                RESTON, VA
1             ARLINGTON, VA
2                DULLES, VA
3    NORTH CHESTERFIELD, VA
4              HOPEWELL, VA
dtype: object
------------------------------


In [99]:
# create full_location by combining cty and st
full_location = thousand['cty'] + ', ' + thousand['st']

In [96]:
# print out the columns and their associated index
for i, v in enumerate(list(thousand.columns)):
    print('Column index: {} Column name: {}'.format(i, v))

Column index: 0 Column name: npi
Column index: 1 Column name: ind_pac_id
Column index: 2 Column name: ind_enrl_id
Column index: 3 Column name: full_nm
Column index: 4 Column name: gndr
Column index: 5 Column name: cred
Column index: 6 Column name: med_sch
Column index: 7 Column name: grd_yr
Column index: 8 Column name: pri_spec
Column index: 9 Column name: sec_spec_1
Column index: 10 Column name: sec_spec_2
Column index: 11 Column name: sec_spec_3
Column index: 12 Column name: sec_spec_4
Column index: 13 Column name: sec_spec_all
Column index: 14 Column name: org_lgl_nm
Column index: 15 Column name: org_pac_id
Column index: 16 Column name: num_org_mem
Column index: 17 Column name: full_adr
Column index: 18 Column name: ln_2_sprs
Column index: 19 Column name: cty
Column index: 20 Column name: st
Column index: 21 Column name: zip
Column index: 22 Column name: phn_numbr
Column index: 23 Column name: hosp_afl_1
Column index: 24 Column name: hosp_afl_lbn_1
Column index: 25 Column name: hosp

In [101]:
# insert full location after st column (index = 21)
thousand.insert(loc=21, column='full_location', value=full_location)

In [102]:
# see if insertion worked
thousand.head(2)

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,full_adr,ln_2_sprs,cty,st,full_location,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1850 TOWN CTR PKWY,N,RESTON,VA,"RESTON, VA",201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182,1701 N GEORGE MASON DR,N,ARLINGTON,VA,"ARLINGTON, VA",222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


## _Extract info from `full_location`, search by city and state_

In [107]:
# how do we extract information from full_location
# method 1: split on the comma, then extract
for loc in thousand['full_location'].head().str.split(','):
    print(loc[0].strip(), loc[1].strip())

RESTON VA
ARLINGTON VA
DULLES VA
NORTH CHESTERFIELD VA
HOPEWELL VA


In [110]:
# import library that can potentially helps us query USA zip codes
import uszipcode

In [112]:
# initialize US Zip code search engine from uszipcode library
search = uszipcode.SearchEngine()

In [122]:
# split first observation from full_location
cty, st = [x.strip() for x in thousand['full_location'][0].split(',')]
print(cty, st)
#search.by_city_and_state(thousand['zip'][0])

RESTON VA


In [148]:
# input cty and st into uszipcode search_by_city_and_state
(search.by_city_and_state(city=cty, state=st)[0]).values()[0]

'20190'

In [151]:
import time
import random

# generate a loop that goes through first 25 rows and extracts zip code
for city in thousand['full_location'].head(25):
    # initialize new US zip code search engine
    search = uszipcode.SearchEngine()
    # split observation into cty and st
    cty, st = [x.strip() for x in city.split(', ')]
    # input cty, st to search for zip code
    zipcode = (search.by_city_and_state(city=cty, state=st)[0]).values()[0]
    time.sleep(random.uniform(0,1))
    print(cty, st, zipcode)

RESTON VA 20190
ARLINGTON VA 22201
DULLES VA 20189
NORTH CHESTERFIELD VA 23831
HOPEWELL VA 23860
GLENVIEW IL 60025
GLENVIEW IL 60025
HIGHLAND PARK IL 60035
HIGHALND PARK IL 60035
SKOKIE IL 60076
SKOKIE IL 60076
SKOKIE IL 60076
EVANSTON IL 60201
EVANSTON IL 60201
TOLEDO OH 43601
TOLEDO OH 43601
OREGON OH 43616
TOLEDO OH 43601
TOLEDO OH 43601
TOLEDO OH 43601
MENTOR OH 44060
AURORA CO 80010
AURORA CO 80010
ORANGE CITY FL 32763
ORANGE CITY FL 32763


In [160]:
# lets put the above into a new function
def zipcode_lookup(df, empty_list):
    # generate a loop that goes rows extracts zip code
    for city in df['full_location']:
        # initialize new US zip code search engine
        search = uszipcode.SearchEngine()
        # split observation into cty and st
        cty, st = [x.strip() for x in city.split(', ')]
        # input cty, st to search for zip code
        zipcode = (search.by_city_and_state(city=cty, state=st)[0]).values()[0]
        # append zipcode to empty_list
        empty_list.append(zipcode)
        time.sleep(random.uniform(0,1))
        print(cty, st, zipcode)
        
    return empty_list

In [161]:
zip_list = []
# lets test the above function using only the first 25 rows again
updated_zip_list = zipcode_lookup(thousand, zip_list)

RESTON VA 20190
ARLINGTON VA 22201
DULLES VA 20189
NORTH CHESTERFIELD VA 23831
HOPEWELL VA 23860
GLENVIEW IL 60025
GLENVIEW IL 60025
HIGHLAND PARK IL 60035
HIGHALND PARK IL 60035
SKOKIE IL 60076
SKOKIE IL 60076
SKOKIE IL 60076
EVANSTON IL 60201
EVANSTON IL 60201
TOLEDO OH 43601
TOLEDO OH 43601
OREGON OH 43616
TOLEDO OH 43601
TOLEDO OH 43601
TOLEDO OH 43601
MENTOR OH 44060
AURORA CO 80010
AURORA CO 80010
ORANGE CITY FL 32763
ORANGE CITY FL 32763
ORANGE CITY FL 32763
QUAKERTOWN PA 18951
QUAKERTOWN PA 18951
TULSA OK 74103
LOS ANGELES CA 90001
LOS ANGELES CA 90001
MOUNT VERNON OH 43050
MOUNT VERNON OH 43050
HOUSTON TX 77002


IndexError: list index out of range

In [200]:
thousand['zip'][100:150].astype(str).str.len().value_counts()

9    40
8     8
5     1
4     1
Name: zip, dtype: int64

In [195]:
for index, row in thousand[100:150].iterrows():
    if len(row['zip']) == 9: 
        pass
    elif len(row['zip']) == 5:
        pass
    else:
        print(index, 'Not 9 or 5', '\nRow zip = {}'.format(row['zip']))

102 Not 9 or 5 
Row zip = 8824
108 Not 9 or 5 
Row zip = 21842508
113 Not 9 or 5 
Row zip = 30872008
114 Not 9 or 5 
Row zip = 33015300
115 Not 9 or 5 
Row zip = 38015409
116 Not 9 or 5 
Row zip = 38202403
117 Not 9 or 5 
Row zip = 38852585
136 Not 9 or 5 
Row zip = 77483052
139 Not 9 or 5 
Row zip = 79606136


In [206]:
# create an empty list for test below
test = []

In [207]:
# see if we can create a counter for the above to return the number of values that aren't 9 or 5 and returns a list of indexes
count = 0
for index, row in thousand[100:150].iterrows():
    if len(row['zip']) == 9: 
        pass
    elif len(row['zip']) == 5:
        pass
    else:
        count += 1
        # append index to test list
        test.append(index)
        # print(index, 'Not 9 or 5', '\nRow zip = {}'.format(row['zip']))

In [208]:
count, test

(9, [102, 108, 113, 114, 115, 116, 117, 136, 139])

In [210]:
whole_test = []

In [211]:
# see if we can create a counter for the above to return the number of values that aren't 9 or 5 and returns a list of indexes for whole dataset
count = 0
for index, row in thousand.iterrows():
    if len(row['zip']) == 9: 
        pass
    elif len(row['zip']) == 5:
        pass
    else:
        count += 1
        # append index to test list
        whole_test.append(index)

In [213]:
count, whole_test[:5]

(73, [46, 47, 64, 65, 69])

In [214]:
thousand['zip'].astype(str).str.len().value_counts()

9    918
8     66
5      9
7      5
4      2
Name: zip, dtype: int64

In [215]:
66 + 5 + 2

73

In [217]:
for index, val in thousand[100:150].iterrows():
    if index in whole_test:
        print('Index is in whole_test list.')
    else:
        pass

Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.
Index is in whole_test list.


In [238]:
for index, row in thousand[100:150].iterrows():
    if index in whole_test:
        print('Index is in whole_test list.')
        # split observation into cty and st
        cty, st = [x.strip() for x in row['full_location'].split(', ')]
        # input cty, st to search for zip code
        zipcode = (search.by_city_and_state(city=cty, state=st)[0]).values()[0]
        print(cty, st, zipcode)
        # assign new zipcode to zip column
        thousand['zip'][index] = zipcode
        # print(thousand['zip'].at(int(index)))
        print('Input new zip code for observation')
        time.sleep(random.uniform(0,1))
    else:
        pass

Index is in whole_test list.
KENDALL PARK NJ 08824


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


Input new zip code for observation
Index is in whole_test list.
BRAINTREE MA 02184
Input new zip code for observation
Index is in whole_test list.
WINDHAM NH 03087
Input new zip code for observation
Index is in whole_test list.
CONCORD NH 03301
Input new zip code for observation
Index is in whole_test list.
PORTSMOUTH NH 03801
Input new zip code for observation
Index is in whole_test list.
DOVER NH 03820
Input new zip code for observation
Index is in whole_test list.
STRATHAM NH 03885
Input new zip code for observation
Index is in whole_test list.
MIDDLETOWN NJ 07748
Input new zip code for observation
Index is in whole_test list.
MORRISTOWN NJ 07960
Input new zip code for observation


Now let's check the value counts for the rows between `100:150`; there should no longer be any zip code values that not either 9 or 5 in length.

In [239]:
thousand['zip'][100:150].astype(str).str.len().value_counts()

9    40
5    10
Name: zip, dtype: int64

In [240]:
thousand['zip'].astype(str).str.len().value_counts()

9    918
8     58
5     18
7      5
4      1
Name: zip, dtype: int64

In [241]:
# see if we can create a counter for the above to return the number of values that aren't 9 or 5 and returns a list of indexes for whole dataset
whole_test = []
count = 0
for index, row in thousand.iterrows():
    if len(row['zip']) == 9: 
        pass
    elif len(row['zip']) == 5:
        pass
    else:
        count += 1
        # append index to test list
        whole_test.append(index)

In [243]:
count

64

In [244]:
58 + 5 + 1

64

We're closer to confirming that we have something that will update the `zip` column to how we want to! Let's generate a function from the example a few cells above so that we can tinker if necessary.

In [254]:
def zipcode_looker(df, index_list):
    count = 0
    for index, row in df.iterrows():
        if index in index_list:
            print('Found index that needs updating at index {}.'.format(index))
            # split observation into cty and st
            cty, st = [x.strip() for x in row['full_location'].split(', ')]
            # input cty, st to search for zip code
            zipcode = (search.by_city_and_state(city=cty, state=st)[0]).values()[0]
            # print(cty, st, zipcode)
            # assign new zipcode to zip column
            df['zip'][index] = zipcode
            # print(thousand['zip'].at(int(index)))
            print('Input new zip code for observation at index {}'.format(index))
            count += 1
            print('Rows edited = {}'.format(count))
            time.sleep(random.uniform(0,0.5))
            
    else:
        pass
    
    return df

In [255]:
# create a copy of thousand to be able to compare before/after value counts
before = thousand.copy()

In [256]:
# check value counts of before
before['zip'].astype(str).str.len().value_counts()

9    918
8     58
5     18
7      5
4      1
Name: zip, dtype: int64

In [257]:
# pass before into zipcode function: initial test
after = zipcode_looker(before, whole_test);

Found index that needs updating at index 46.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Input new zip code for observation at index 46
Rows edited = 1
Found index that needs updating at index 47.
Input new zip code for observation at index 47
Rows edited = 2
Found index that needs updating at index 64.
Input new zip code for observation at index 64
Rows edited = 3
Found index that needs updating at index 65.
Input new zip code for observation at index 65
Rows edited = 4
Found index that needs updating at index 69.
Input new zip code for observation at index 69
Rows edited = 5
Found index that needs updating at index 77.
Input new zip code for observation at index 77
Rows edited = 6
Found index that needs updating at index 78.
Input new zip code for observation at index 78
Rows edited = 7
Found index that needs updating at index 82.
Input new zip code for observation at index 82
Rows edited = 8
Found index that needs updating at index 93.
Input new zip code for observation at index 93
Rows edited = 9
Found index that needs updating at index 94.
Input new zip code for obser

IndexError: list index out of range

While our function didn't technically work (again...) we have a clue this time thanks to our function printing out the indexes of each observation that was going to be changed. We can see that at index `889` the zip code for that particular observation needed to be updated (due to it not having either a length of 5 or 9) however, the variable `zipcode` in our function returned an `IndexError: list index out of range`. Up to this point, I haven't thought that the list it is talking about might be associated with the returned object from the `uszipcodes` search! 

Let's look closer at the observation at index `889`, and see what returns for that `cty`, `st` combination.

In [259]:
# see what the observation looks like
before.iloc[889]

npi                                   1003021866
ind_pac_id                            4082844022
ind_enrl_id                      I20140313000428
full_nm                            DONNA  MELLE 
gndr                                           F
cred                                  Not Listed
med_sch                                    OTHER
grd_yr                                      1995
pri_spec           CERTIFIED NURSE MIDWIFE (CNM)
sec_spec_1                                  None
sec_spec_2                                  None
sec_spec_3                                  None
sec_spec_4                                  None
sec_spec_all                                None
org_lgl_nm        ATLANTICARE PHYSICIAN GROUP PA
org_pac_id                           8.52795e+09
num_org_mem                                  382
full_adr                   65 W JIMMIE LEEDS RD 
ln_2_sprs                                      N
cty                                       POMONA
st                  

In [264]:
# search by POMONA, NJ
search.by_city_and_state(city="POMONA", state="NJ") # .values()[0]

[]

Think we may have found the culprit! The issue is when we search for the city of `POMONA, NJ` we get an empty list! And because we are calling the value at index `0` in our function, this will cause an error since there is no values in the list! We can update our function to pass and then possibly store the index we were't able to find, and thus, update. 

In [284]:
def zipcode_looker(df, index_list):
    # counters to keep track of number of indexes updated or not updated
    updated_count = 0
    noupdate_count = 0
    # list to store indexes that weren't updated
    noupdate_list = []
    for index, row in df.iterrows():
        if index in index_list:
            print('Found index that needs updating at index {}.'.format(index))
            # split observation into cty and st
            cty, st = [x.strip() for x in row['full_location'].split(', ')]
            # use searchengine to look up cty and state
            lookup = search.by_city_and_state(city=cty, state=st)
            # check if zi is an empty list
            if lookup == []:
                print('Could not update the observation at index {}'.format(index))
                noupdate_count += 1
                print('Rows edited: {}\nRows Not Updated Due to Error: {}'.format(updated_count, noupdate_count))
                noupdate_list.append(index)
                print('-' * 30)
                pass
            else:
                # input cty, st to search for zip code
                zipcode = lookup[0].values()[0]
                # assign new zipcode to zip column
                df['zip'][index] = zipcode
                # print(thousand['zip'].at(int(index)))
                print('Input new zip code for observation at index {}'.format(index))
                updated_count += 1
                print('Rows edited: {}\nRows Not Updated Due to Error: {}'.format(updated_count, noupdate_count))
                print('-' * 30)
                time.sleep(random.uniform(0,0.5))
            
    else:
        pass
    
    return df, noupdate_list

In [285]:
zipcode = []
zipcode == []

True

In [286]:
# check value counts of before
before['zip'].astype(str).str.len().value_counts()

9    918
5     81
8      1
Name: zip, dtype: int64

In [287]:
# pass before into zipcode function: second test
after, noupdate_list = zipcode_looker(before, whole_test);

Found index that needs updating at index 46.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Input new zip code for observation at index 46
Rows edited: 1
Rows Not Updated Due to Error: 0
------------------------------
Found index that needs updating at index 47.
Input new zip code for observation at index 47
Rows edited: 2
Rows Not Updated Due to Error: 0
------------------------------
Found index that needs updating at index 64.
Input new zip code for observation at index 64
Rows edited: 3
Rows Not Updated Due to Error: 0
------------------------------
Found index that needs updating at index 65.
Input new zip code for observation at index 65
Rows edited: 4
Rows Not Updated Due to Error: 0
------------------------------
Found index that needs updating at index 69.
Input new zip code for observation at index 69
Rows edited: 5
Rows Not Updated Due to Error: 0
------------------------------
Found index that needs updating at index 77.
Input new zip code for observation at index 77
Rows edited: 6
Rows Not Updated Due to Error: 0
------------------------------
Found index that ne

In [288]:
# check value counts of after
after['zip'].astype(str).str.len().value_counts()

9    918
5     81
8      1
Name: zip, dtype: int64

In [290]:
# what is the index that could not be updated?
noupdate_list

[889]

In [293]:
# what does observation at index 889 look like?
after.iloc[889]

npi                                   1003021866
ind_pac_id                            4082844022
ind_enrl_id                      I20140313000428
full_nm                            DONNA  MELLE 
gndr                                           F
cred                                  Not Listed
med_sch                                    OTHER
grd_yr                                      1995
pri_spec           CERTIFIED NURSE MIDWIFE (CNM)
sec_spec_1                                  None
sec_spec_2                                  None
sec_spec_3                                  None
sec_spec_4                                  None
sec_spec_all                                None
org_lgl_nm        ATLANTICARE PHYSICIAN GROUP PA
org_pac_id                           8.52795e+09
num_org_mem                                  382
full_adr                   65 W JIMMIE LEEDS RD 
ln_2_sprs                                      N
cty                                       POMONA
st                  

In [294]:
# let's see what returns when we input cty and st
search.by_city_and_state(city="POMONA", state="NJ")

[]

### _Update: `Pomona, New Jersey`_

It looks like this particular location is an 'unincorporated community and census-designated place (CDP) located within Galloway Township" ([Wikipedia](https://en.wikipedia.org/wiki/Pomona,_New_Jersey)). From my interpretation this is sort of a town within a town. Now comes the hard part: do we just retrieve the corresponding zip code or include the zip code of the associated township? 

Before we rush into an answer, let's update our `zipcode_looker` function to make it more functional by eliminating the printouts and editing so that we might be able to use something like `apply()` to generate a new column.

In [308]:
def zip_generator(row):
    # split observation into cty and st
    cty, st = [x.strip() for x in row['full_location'].split(', ')]
    # use searchengine to look up cty and state
    lookup = search.by_city_and_state(city=cty, state=st)
    if lookup == []:
        return row['zip']
    else:
        zipcode = lookup[0].values()[0]
        time.sleep(random.uniform(0, 0.25))
        return zipcode

In [309]:
# test to see if the above function works with apply
thousand.head(5).apply(lambda row: zip_generator(row), axis=1)

0    20190
1    22201
2    20189
3    23831
4    23860
dtype: object

In [326]:
# create a copy of data set that we can test on
tester = thousand.copy()

In [327]:
# let's test to see if we can generate a new column using the function we just created with apply
# tester.apply(lambda row: zip_generator(row), axis=1)
# it worked but was taking WAY too long...

In [338]:
# update on function above
def zip_generator(row):
    if len(row['zip']) == 9:
        return row['zip'][:5]
        pass
    elif len(row['zip']) == 5:
        return row['zip']
        pass
    else:
        # split observation into cty and st
        cty, st = [x.strip() for x in row['full_location'].split(', ')]
        # use searchengine to look up cty and state
        lookup = search.by_city_and_state(city=cty, state=st)
        if lookup == []:
            return row['zip']
        else:
            zipcode = lookup[0].values()[0]
            #time.sleep(random.uniform(0, 0.25))
            return zipcode

In [339]:
# test to see if the above function works with apply
tester.head(5).apply(lambda row: zip_generator(row), axis=1)

0    20190
1    22205
2    20166
3    23235
4    23860
dtype: object

In [340]:
%%time 
# let's test it now on the first 100 hundreds rows and time it
# test to see if the above function works with apply
tester.head(100).apply(lambda row: zip_generator(row), axis=1)

CPU times: user 1.11 s, sys: 35.9 ms, total: 1.14 s
Wall time: 1.35 s


0     20190
1     22205
2     20166
3     23235
4     23860
5     60026
6     60026
7     60035
8     60035
9     60076
10    60076
11    60076
12    60201
13    60201
14    43608
15    43608
16    43616
17    43623
18    43623
19    43623
20    44060
21    80045
22    80045
23    32763
24    32763
25    32763
26    18951
27    18951
28    74104
29    90095
      ...  
70    74104
71    48059
72    30094
73    48532
74    77656
75    94538
76    34669
77    07731
78    08827
79    44087
80    70433
81    91402
82    02188
83    97213
84    63090
85    77566
86    14618
87    14642
88    37777
89    37801
90    95370
91    15202
92    15212
93    07410
94    07834
95    29615
96    85711
97    80601
98    15905
99    49337
Length: 100, dtype: object

In [341]:
%%time

# ok let's test it on the whole data set and create a new column
tester['new_zip'] = tester.apply(lambda row: zip_generator(row), axis=1)

CPU times: user 6.52 s, sys: 140 ms, total: 6.66 s
Wall time: 6.69 s


In [342]:
# check out the unique lengths in new_zip column
# check value counts of after
tester['new_zip'].astype(str).str.len().value_counts()

5    999
8      1
Name: new_zip, dtype: int64

In [343]:
((2200000 / 1000) * 6.69) / 60 /24

10.220833333333333