# _Physician Compare National_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

In [1]:
# import pandas
import pandas as pd
pd.options.display.max_columns = None

In [3]:
# import CSV file with data
data = pd.read_csv('physician_compare_national.csv'); 

In [4]:
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


Looks like we need to correct the order of the columns, like we did in the previous notebook. After we update the column order, we'll then overwrite `physician_compare_national.csv` file to reflect this change so we won't have to again going forward.

In [7]:
# create list of columns in correct order 
correct_order = ['npi', 'ind_pac_id', 'ind_enrl_id', 'lst_nm', 'frst_nm', 'mid_nm', 'suff', 
                'gndr', 'cred', 'med_sch', 'grd_yr', 'pri_spec', 'sec_spec_1', 'sec_spec_2', 
                'sec_spec_3', 'sec_spec_4', 'sec_spec_all', 'org_lgl_nm', 'org_pac_id', 
                'num_org_mem', 'adr_ln_1', 'adr_ln_2', 'ln_2_sprs', 'cty', 'st', 'zip', 
                'phn_numbr', 'hosp_afl_1', 'hosp_afl_lbn_1', 'hosp_afl_2', 'hosp_afl_lbn_2', 
                'hosp_afl_3', 'hosp_afl_lbn_3', 'hosp_afl_4', 'hosp_afl_lbn_4', 'hosp_afl_5', 
                'hosp_afl_lbn_5', 'assgn']

In [8]:
# reorder columns to original format
data = data[correct_order]

In [9]:
# overwrite original CSV file to reflect correct order (only needed to run this first time through)
# data.to_csv('physician_compare_national.csv', index=False)

# _Diving Back into the Data!_

Yesterday, we started by exploring the breakdown of physicians by state. However, I came to realize that this might be somewhat misleading as there appeared to be duplicate entries in the data (i.e. the same doctor was entered in multiple times). One of the variables though - `ind_enrl_id`, or the _Professional Enrollment ID_ - is a "unique ID for the individiaul professional enrollment that is the source for the data in the observation". This appears to be a better variable for identifying unique physicians versus the name. Let's see if this is in fact true or not with some exploration.

In [10]:
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


In [11]:
# get overview of the columns and dtypes within each one
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2210790 entries, 0 to 2210789
Data columns (total 38 columns):
npi               int64
ind_pac_id        int64
ind_enrl_id       object
lst_nm            object
frst_nm           object
mid_nm            object
suff              object
gndr              object
cred              object
med_sch           object
grd_yr            float64
pri_spec          object
sec_spec_1        object
sec_spec_2        object
sec_spec_3        object
sec_spec_4        object
sec_spec_all      object
org_lgl_nm        object
org_pac_id        float64
num_org_mem       float64
adr_ln_1          object
adr_ln_2          object
ln_2_sprs         object
cty               object
st                object
zip               object
phn_numbr         float64
hosp_afl_1        object
hosp_afl_lbn_1    object
hosp_afl_2        object
hosp_afl_lbn_2    object
hosp_afl_3        object
hosp_afl_lbn_3    object
hosp_afl_4        object
hosp_afl_lbn_4    object
hosp_afl_5

In [12]:
# see the percentage of observations that are NaNs in each column
round((data.isna().mean() * 100), 3)

npi                0.000
ind_pac_id         0.000
ind_enrl_id        0.000
lst_nm             0.002
frst_nm            0.001
mid_nm            24.582
suff              98.164
gndr               0.000
cred              68.976
med_sch            0.000
grd_yr             0.235
pri_spec           0.000
sec_spec_1        85.576
sec_spec_2        98.406
sec_spec_3        99.833
sec_spec_4        99.973
sec_spec_all      85.576
org_lgl_nm         8.051
org_pac_id         8.051
num_org_mem        8.051
adr_ln_1           0.000
adr_ln_2          65.159
ln_2_sprs         94.270
cty                0.000
st                 0.000
zip                0.000
phn_numbr         15.288
hosp_afl_1        26.610
hosp_afl_lbn_1    26.715
hosp_afl_2        59.499
hosp_afl_lbn_2    59.635
hosp_afl_3        76.644
hosp_afl_lbn_3    76.739
hosp_afl_4        85.746
hosp_afl_lbn_4    85.799
hosp_afl_5        90.769
hosp_afl_lbn_5    90.807
assgn              0.000
dtype: float64

Ok so we can see that there are quite a few columns that have a significant number of `NaN`s, which is something we'll have to address as we move forward. However, `ind_enrl_id` looks good to go, which is a good first step. The next thing we'll take a look out after exploring `ind_enrl_id` is how to possibly combine the first, middle and last name into one column, or if it is even necessary.

In [13]:
# see how many unique professional enrollment ID numbers there are
len(data['ind_enrl_id'].unique())

1147121

Ok so there appears to be a little over 1.1 million unique professional enrollment IDs in the data set, which is way less than the total number of observations. 

In [14]:
# combine first and last name
data['frst_nm'].str.cat(data['lst_nm'], sep=" ")

0            ARDALAN ENKESHAFI
1            ARDALAN ENKESHAFI
2            ARDALAN ENKESHAFI
3            ARDALAN ENKESHAFI
4            ARDALAN ENKESHAFI
5                THOMAS CIBULL
6                THOMAS CIBULL
7                THOMAS CIBULL
8                THOMAS CIBULL
9                THOMAS CIBULL
10               THOMAS CIBULL
11               THOMAS CIBULL
12               THOMAS CIBULL
13               THOMAS CIBULL
14               RASHID KHALIL
15               RASHID KHALIL
16               RASHID KHALIL
17               RASHID KHALIL
18               RASHID KHALIL
19               RASHID KHALIL
20            JENNIFER VELOTTA
21             KEVIN ROTHCHILD
22             KEVIN ROTHCHILD
23           FREDERICK WEIGAND
24           FREDERICK WEIGAND
25           FREDERICK WEIGAND
26             AMANDA SEMONCHE
27             AMANDA SEMONCHE
28                     DAE KIM
29            PEYMAN BENHARASH
                  ...         
2210760             LISA DRIES
2210761 

As we can see above, there are multiple cases duplicate entries are evident by combining first and last name. Now let's see how many unique names there are.

In [15]:
# combine first and last name and see how many unique names there are
len(data['frst_nm'].str.cat(data['lst_nm'], sep=" ").unique())

933045

Interesting...there are `933045` unique names but `1147121` unique professional enrollment IDs. That's a difference of over 200k+ observations! What is going on here?

In [16]:
# let's revisit how many missing values there are in the name-related columns
data[['lst_nm', 'frst_nm', 'mid_nm', 'suff']].isna().sum()

lst_nm          37
frst_nm         24
mid_nm      543463
suff       2170208
dtype: int64

In [17]:
# let's replace the NaNs with blank "" to make them a little easier to work with
data[['lst_nm', 'frst_nm']] = data[['lst_nm', 'frst_nm']].fillna(value="")

In [18]:
# let's dive into the missing values for first and last name
data[data['lst_nm'] == ""]

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
136176,1063454130,8628071131,I20060822000324,,BENJAMIN,HANBOK,,M,OD,OTHER,2001.0,OPTOMETRY,,,,,,,,,115 W 25TH AVE,,,SAN MATEO,CA,944032259,6503496000.0,,,,,,,,,,,Y
136177,1063454130,8628071131,I20060822000324,,BENJAMIN,HANBOK,,M,OD,OTHER,2001.0,OPTOMETRY,,,,,,,,,3191 CROW CANYON PL,SUITE C,,SAN RAMON,CA,945831349,9252441000.0,,,,,,,,,,,Y
177939,1083084875,4284937251,I20160125002851,,TRAVIS,J,,M,,OTHER,2015.0,NURSE PRACTITIONER,,,,,,NORTHWEST MEDICAL CENTER ASSOCIATION INC,5496642000.0,5.0,1607 E US HWY 136,MOSAIC FAMILY CARE ALBANY EAST,,ALBANY,MO,644028223,6607263000.0,261328.0,NORTHWEST MEDICAL CENTER,260006.0,MOSAIC LIFE CARE AT ST JOSEPH,,,,,,,Y
243886,1114025665,941246326,I20050707000585,,TERRY,,,M,OD,INDIANA UNIVERSITY - SCHOOL OF OPTOMETRY,1983.0,OPTOMETRY,,,,,,VISIONQUEST EYECARE PC,7517951000.0,6.0,1160 N STATE RD 135,,,GREENWOOD,IN,461421019,3178657000.0,,,,,,,,,,,Y
243887,1114025665,941246326,I20050707000585,,TERRY,,,M,OD,INDIANA UNIVERSITY - SCHOOL OF OPTOMETRY,1983.0,OPTOMETRY,,,,,,VISIONQUEST EYECARE PC,7517951000.0,6.0,9650 E WASHINGTON ST,,,INDIANAPOLIS,IN,462293032,3178906000.0,,,,,,,,,,,Y
298158,1134289879,941201511,I20070124000749,,STEVE,C,,M,MD,OTHER,2003.0,FAMILY MEDICINE,,,,,,SOUTHERN CALIFORNIA PERMANENTE MEDICAL GROUP,6002729000.0,8118.0,30400 CAMINO CAPISTRANO,,,SAN JUAN CAPISTRANO,CA,926751300,9492342000.0,,,,,,,,,,,Y
298159,1134289879,941201511,I20070124000749,,STEVE,C,,M,MD,OTHER,2003.0,FAMILY MEDICINE,,,,,,SOUTHERN CALIFORNIA PERMANENTE MEDICAL GROUP,6002729000.0,8118.0,5475 E LA PALMA AVE,,,ANAHEIM,CA,928072075,7142796000.0,,,,,,,,,,,Y
352205,1154847614,3375802556,I20180116000499,,TANYA,M,,F,,OTHER,2017.0,NURSE PRACTITIONER,,,,,,FAYETTE PHYSICIAN NETWORK INC,3375866000.0,99.0,25 MAIN ST,,,SMITHFIELD,PA,154788943,7245691000.0,,,,,,,,,,,Y
445731,1205119435,840458907,I20120221000078,,CHRISTINE,M,,F,,OTHER,2011.0,PHYSICAL THERAPY,,,,,,SUTTER VALLEY MEDICAL FOUNDATION,9830095000.0,1492.0,9280 W STOCKTON BLVD,SUITE 116,,ELK GROVE,CA,957588073,9166833000.0,,,,,,,,,,,Y
445732,1205119435,840458907,I20120221000078,,CHRISTINE,M,,F,,OTHER,2011.0,PHYSICAL THERAPY,,,,,,SUTTER VALLEY MEDICAL FOUNDATION,9830095000.0,1492.0,2800 L ST,,Y,SACRAMENTO,CA,958165616,,,,,,,,,,,,Y


In [19]:
data[data['frst_nm'] == ""]

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
80900,1033381983,4183808488,I20110412000132,MA,,,,F,,OTHER,1997.0,OPHTHALMOLOGY,,,,,,,,,9 MAC LEAN DR,,,GLEN HEAD,NY,115453137,,330193.0,FLUSHING HOSPITAL MEDICAL CENTER,,,,,,,,,Y
80901,1033381983,4183808488,I20110412000132,MA,,,,F,,OTHER,1997.0,OPHTHALMOLOGY,,,,,,EVERGREEN MEDICAL PLLC,1355539000.0,8.0,755 -759,,,BROOKLYN,NY,112204211,7186809000.0,330193.0,FLUSHING HOSPITAL MEDICAL CENTER,,,,,,,,,Y
80902,1033381983,4183808488,I20110412000132,MA,,,,F,,OTHER,1997.0,OPHTHALMOLOGY,,,,,,"CONEY ISLAND MEDICAL PRACTICE PLAN, P.C.",5496945000.0,284.0,2601 OCEAN PKWY,,,BROOKLYN,NY,112357745,7186165000.0,330193.0,FLUSHING HOSPITAL MEDICAL CENTER,,,,,,,,,Y
128852,1053726695,42532210,I20190130002955,JEN,,,,F,,LAKE ERIE COLLEGE OF OSTEOPATHIC MEDICINE,2014.0,FAMILY MEDICINE,,,,,,HIGH DESERT MEDICAL CORPORATION,6103731000.0,70.0,43839 N 15TH ST W,,,LANCASTER,CA,935344756,6619456000.0,,,,,,,,,,,Y
500235,1225268824,4385868678,I20140605002253,SHEN,,,,F,,UNIVERSITY OF MARYLAND SCHOOL OF MEDICINE,2009.0,ENDOCRINOLOGY,,,,,,UC REGENTS,1355249000.0,841.0,2020 SANTA MONICA BLVD,,,SANTA MONICA,CA,904042023,3104582000.0,50262.0,RONALD REAGAN U C L A MEDICAL CENTER,50549.0,LOS ROBLES HOSPITAL & MEDICAL CENTER,,,,,,,Y
500236,1225268824,4385868678,I20140605002253,SHEN,,,,F,,UNIVERSITY OF MARYLAND SCHOOL OF MEDICINE,2009.0,ENDOCRINOLOGY,,,,,,UC REGENTS,1355249000.0,841.0,2020 SANTA MONICA BLVD,SUITE 240,,SANTA MONICA,CA,904042023,3105826000.0,50262.0,RONALD REAGAN U C L A MEDICAL CENTER,50549.0,LOS ROBLES HOSPITAL & MEDICAL CENTER,,,,,,,Y
500237,1225268824,4385868678,I20140605002253,SHEN,,,,F,,UNIVERSITY OF MARYLAND SCHOOL OF MEDICINE,2009.0,ENDOCRINOLOGY,,,,,,UC REGENTS,1355249000.0,841.0,6633 TELEPHONE RD,,Y,VENTURA,CA,930035569,8056428000.0,50262.0,RONALD REAGAN U C L A MEDICAL CENTER,50549.0,LOS ROBLES HOSPITAL & MEDICAL CENTER,,,,,,,Y
885207,1407012263,8527215920,I20120820000693,LEE,,Y,,F,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2007.0,ANESTHESIOLOGY,,,,,,PHYSICIANS ANESTHESIA SERVICE PLLC,6901792000.0,151.0,1229 MADISON ST,SUITE 1440,,SEATTLE,WA,981043538,2066251000.0,500025.0,SWEDISH MEDICAL CENTER / CHERRY HILL,500027.0,SWEDISH MEDICAL CENTER - FIRST HILL/BALLARD,,,,,,,Y
1138928,1518288307,4082929229,I20150817001415,LIU,,,,F,,OTHER,1994.0,HEMATOLOGY/ONCOLOGY,INTERNAL MEDICINE,,,,INTERNAL MEDICINE,ST MARYS REGIONAL MEDICAL CENTER,42107120.0,186.0,93 CAMPUS AVE,,,LEWISTON,ME,42406030,2077774000.0,200034.0,ST MARYS REGIONAL MEDICAL CENTER,,,,,,,,,Y
1141501,1518399344,6002138021,I20141202002187,VANG,,,,F,,OTHER,2011.0,FAMILY MEDICINE,,,,,,BAYLOR ST LUKES MEDICAL GROUP,9133213000.0,256.0,210 LAKE RD,,Y,LAKE JACKSON,TX,775664982,71379820000.0,450072.0,CHI ST. LUKES' BRAZOSPORT HOSPITAL,450018.0,UNIVERSITY OF TEXAS MEDICAL BRANCH GALVESTON,,,,,,,Y


Due to the number of observations being so low for `lst_nm` and `frst_nm` - 37 and 24, respectively - compared to the overall data set, a good tactic might be to drop them. However, when we look at the observations, they still have value as there is information regarding the school attended, year graduated, specialty, etc. so dropping them might not be the best idea. Plus none of them are missing their professional enrollment ID, so what should we do? 

Let's assign a value of `missing` as a string to any observation that does not have a first or last name. This will allow us to keep the rest of the data for the observations while also affirming that there is incomplete information regarding exactly who this physician is. 

In [20]:
# place "" with "missing" in both first and last name
data[['lst_nm', 'frst_nm']] = data[['lst_nm', 'frst_nm']].replace("", "Missing")

In [21]:
# example of missing first name and how it was replaced (look at frst_nm)
data.iloc[80900]

npi                                     1033381983
ind_pac_id                              4183808488
ind_enrl_id                        I20110412000132
lst_nm                                          MA
frst_nm                                    Missing
mid_nm                                         NaN
suff                                           NaN
gndr                                             F
cred                                           NaN
med_sch                                      OTHER
grd_yr                                        1997
pri_spec                             OPHTHALMOLOGY
sec_spec_1                                     NaN
sec_spec_2                                     NaN
sec_spec_3                                     NaN
sec_spec_4                                     NaN
sec_spec_all                                   NaN
org_lgl_nm                                     NaN
org_pac_id                                     NaN
num_org_mem                    

In [22]:
# example of missing last name and how it was replaced (look at lst_nm)
data.iloc[136176]

npi                    1063454130
ind_pac_id             8628071131
ind_enrl_id       I20060822000324
lst_nm                    Missing
frst_nm                  BENJAMIN
mid_nm                     HANBOK
suff                          NaN
gndr                            M
cred                           OD
med_sch                     OTHER
grd_yr                       2001
pri_spec                OPTOMETRY
sec_spec_1                    NaN
sec_spec_2                    NaN
sec_spec_3                    NaN
sec_spec_4                    NaN
sec_spec_all                  NaN
org_lgl_nm                    NaN
org_pac_id                    NaN
num_org_mem                   NaN
adr_ln_1           115 W 25TH AVE
adr_ln_2                      NaN
ln_2_sprs                     NaN
cty                     SAN MATEO
st                             CA
zip                     944032259
phn_numbr              6.5035e+09
hosp_afl_1                    NaN
hosp_afl_lbn_1                NaN
hosp_afl_2    

The next name related issues we have to address has to do with the middle name and suffix. 

In [23]:
data[['mid_nm', 'suff']].isna().mean()

mid_nm    0.245823
suff      0.981644
dtype: float64

Let's do what we did before and replace `NaN` in these columns with a blank `""`.

In [24]:
# fill nan with ""
data[['mid_nm', 'suff']] = data[['mid_nm', 'suff']].fillna(value="")

In [25]:
# check to see percentage of nan in middle name and suffix columns
data[['mid_nm', 'suff']].isna().mean()

mid_nm    0.0
suff      0.0
dtype: float64

In [26]:
# check percentage of nan in all name related columns
data[['lst_nm', 'frst_nm', 'mid_nm', 'suff']].isna().mean()

lst_nm     0.0
frst_nm    0.0
mid_nm     0.0
suff       0.0
dtype: float64

Ok cool now what we're going to do is combine all four of these columns into one to see if it might help us address the difference in unique IDs and names.

In [28]:
# %%timeit
# data[['frst_nm', 'mid_nm', 'lst_nm', 'suff']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

In [29]:
%%timeit
data['frst_nm'].astype(str) + ' ' + data['mid_nm'].astype(str) + ' ' + data['lst_nm'].astype(str) + ' ' + data['suff'].astype(str)

1.63 s ± 308 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
# create column of full names
data['full_nm'] = data['frst_nm'].astype(str) + ' ' + data['mid_nm'].astype(str) + ' ' + data['lst_nm'].astype(str) + ' ' + data['suff'].astype(str)

In [33]:
# see the first 10 full names
data['full_nm'][:10]

0    ARDALAN  ENKESHAFI 
1    ARDALAN  ENKESHAFI 
2    ARDALAN  ENKESHAFI 
3    ARDALAN  ENKESHAFI 
4    ARDALAN  ENKESHAFI 
5       THOMAS L CIBULL 
6       THOMAS L CIBULL 
7       THOMAS L CIBULL 
8       THOMAS L CIBULL 
9       THOMAS L CIBULL 
Name: full_nm, dtype: object

Ok so we've combined all the name information into one column. Let's see if it helps us at all when it comes to identifying where there is fewer names than IDs.

In [34]:
# return the total number of unique full names
len(data['full_nm'].unique())

1072060

Ok it looks like this helped increase the number of unique names! It is still less than the total number of unique IDs though as we can see below:

In [35]:
# see how many unique professional enrollment ID numbers there are
len(data['ind_enrl_id'].unique())

1147121

In [36]:
1147121 - 1072060

75061

However, the gap is not nearly as large as it was before which is a good thing. We still need to keep digging though to determine why there is this discrepency.

In [37]:
data[['full_nm', 'ind_enrl_id']].head()

Unnamed: 0,full_nm,ind_enrl_id
0,ARDALAN ENKESHAFI,I20130530000085
1,ARDALAN ENKESHAFI,I20130530000085
2,ARDALAN ENKESHAFI,I20150824000105
3,ARDALAN ENKESHAFI,I20150824000105
4,ARDALAN ENKESHAFI,I20150824000105


Luckily we don't have to look too hard as the first five rows shows us that the same physician can have multiple IDs. I hypothesize that this has to do with the primary specialty and any other secondary specialties a physician may have.

In [39]:
data.head(10)

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn,full_nm
0,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y,ARDALAN ENKESHAFI
1,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y,ARDALAN ENKESHAFI
2,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y,ARDALAN ENKESHAFI
3,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y,ARDALAN ENKESHAFI
4,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y,ARDALAN ENKESHAFI
5,1003000134,4284706367,I20080707000385,CIBULL,THOMAS,L,,M,,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE,2003.0,PATHOLOGY,,,,,,NORTHSHORE UNIVERSITY HEALTHSYSTEM FACULTY PRA...,2163335000.0,1286.0,2100 PFINGSTEN RD,,,GLENVIEW,IL,600261301,,140010,NORTHSHORE UNIVERSITY HEALTHSYSTEM - EVANSTON ...,,,,,,,,,Y,THOMAS L CIBULL
6,1003000134,4284706367,I20080707000385,CIBULL,THOMAS,L,,M,,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE,2003.0,PATHOLOGY,,,,,,NORTHSHORE UNIVERSITY HEALTHSYSTEM FACULTY PRA...,2163335000.0,1286.0,2180 PFINGSTEN RD 1ST FLOOR,,Y,GLENVIEW,IL,600261301,,140010,NORTHSHORE UNIVERSITY HEALTHSYSTEM - EVANSTON ...,,,,,,,,,Y,THOMAS L CIBULL
7,1003000134,4284706367,I20080707000385,CIBULL,THOMAS,L,,M,,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE,2003.0,PATHOLOGY,,,,,,NORTHSHORE UNIVERSITY HEALTHSYSTEM FACULTY PRA...,2163335000.0,1286.0,777 PARK W AVE,,,HIGHLAND PARK,IL,600352433,8474328000.0,140010,NORTHSHORE UNIVERSITY HEALTHSYSTEM - EVANSTON ...,,,,,,,,,Y,THOMAS L CIBULL
8,1003000134,4284706367,I20080707000385,CIBULL,THOMAS,L,,M,,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE,2003.0,PATHOLOGY,,,,,,NORTHSHORE UNIVERSITY HEALTHSYSTEM FACULTY PRA...,2163335000.0,1286.0,777 PARK AVE W,,Y,HIGHALND PARK,IL,600352433,,140010,NORTHSHORE UNIVERSITY HEALTHSYSTEM - EVANSTON ...,,,,,,,,,Y,THOMAS L CIBULL
9,1003000134,4284706367,I20080707000385,CIBULL,THOMAS,L,,M,,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE,2003.0,PATHOLOGY,,,,,,NORTHSHORE UNIVERSITY HEALTHSYSTEM FACULTY PRA...,2163335000.0,1286.0,9600 GROSS POINT RD,,,SKOKIE,IL,600761214,,140010,NORTHSHORE UNIVERSITY HEALTHSYSTEM - EVANSTON ...,,,,,,,,,Y,THOMAS L CIBULL


The first three columns - `npi`, `ind_pac_id`, and `ind_enrl_id` - provide us information regarding that particular physician's ID assigned by NPPES, PECOS, and for individual professional enrollment (which is the course for the data in that particular observation). All three are different, and we can see that for the first two physicians listed - `ENKESHAFI ARDALAN` and `CIBULL THOMAS` - that their `npi` and `ind_pac_id` are the same for all observations. However, when we look to `ENKESHAFI ARDALAN` we can see that there are two unique `ind_enrl_id`s: `I20130530000085` & `I20150824000105`.

I think I'm going to have to do a little bit more research about what exactly is going on with `ind_pac_id`. Since we haven't really done any exploration on the other two IDs, let's see if either might serve as a better unique identifier. 

In [40]:
# see how many unqiue observations are in NPI id column
len(data['npi'].unique())

1103330

In [41]:
# see how many unique observations are in PAC ID column
len(data['ind_pac_id'].unique())

1103330

As we can see above, both have the same number of unique IDs. However, there are still more unique values than there are unique names - _1,103,330_ versus _1,072,060_, respectively. This is something I'm very intrigued by but for now let's shelve it and start exploring other facets of the data. The next two columns that catch my eye are the `med_sch` and `grd_yr` columns (i.e. medical school physician attended and their graduation year). 

In [43]:
# lets see what data types we are dealing with, starting with med_sch
print('Data type for the med_sch column: {}'.format(data['med_sch'].dtype))
print('Total number of unique values in the med_sch column: {}'.format(len(data['med_sch'].unique())))

Data type for the med_sch column: object
Total number of unique values in the med_sch column: 401


So we're working with string values, and `401` unique schools. As a proxy to best address duplicate entries (as we saw, the same physican can have multiple rows thus inflating the number of instances of his/her school) we'll group by `npi` and then see the breakdown by school.

In [58]:
# ensure that when we group by NPI 
len(data.groupby('npi')['med_sch'].first())

1103330

In [60]:
# lets see what it looks like as a pandas DataFrame
pd.DataFrame(data.groupby('npi')['med_sch'].first()).head()

Unnamed: 0_level_0,med_sch
npi,Unnamed: 1_level_1
1003000126,OTHER
1003000134,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE
1003000142,OTHER
1003000423,TOLEDO MEDICAL COLLEGE
1003000480,OHIO STATE UNIVERSITY COLLEGE OF MEDICINE


I like what I'm seeing! Let's go ahead and create a `pandas` DataFrame from this so that we can analyze it further.

In [61]:
# create school count dataframe
school_count = pd.DataFrame(data.groupby('npi')['med_sch'].first()); school_count.head()

Unnamed: 0_level_0,med_sch
npi,Unnamed: 1_level_1
1003000126,OTHER
1003000134,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE
1003000142,OTHER
1003000423,TOLEDO MEDICAL COLLEGE
1003000480,OHIO STATE UNIVERSITY COLLEGE OF MEDICINE


In [63]:
# reset the index to make npi its own column
school_count = school_count.reset_index(); school_count.head()

Unnamed: 0,npi,med_sch
0,1003000126,OTHER
1,1003000134,UNIVERSITY OF KENTUCKY COLLEGE OF MEDICINE
2,1003000142,OTHER
3,1003000423,TOLEDO MEDICAL COLLEGE
4,1003000480,OHIO STATE UNIVERSITY COLLEGE OF MEDICINE


In [65]:
# let's get the value counts now for the schools, displaying only the top 10
school_count['med_sch'].value_counts()[:10]

OTHER                                                       589730
INDIANA UNIVERSITY SCHOOL OF MEDICINE                         7375
WAYNE STATE UNIVERSITY SCHOOL OF MEDICINE                     7297
PALMER COLLEGE CHIROPRACTIC - DAVENPORT                       6849
PHILADELPHIA COLLEGE OF OSTEOPATHIC MEDICINE                  6236
UNIVERSITY OF MINNESOTA MEDICAL SCHOOL                        6201
TEMPLE UNIVERSITY SCHOOL OF MEDICINE                          5986
OHIO STATE UNIVERSITY COLLEGE OF MEDICINE                     5986
UNIVERSITY OF ILLINOIS AT CHICAGO HEALTH SCIENCE CENTER       5665
JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON UNIVERSITY      5415
Name: med_sch, dtype: int64

Looks like other is the overwhelming majority of observations! To get a better idea of the distribution, let's use percentages in place of absolute counts.

In [69]:
# normalize counts so it returns a percentage
round((school_count['med_sch'].value_counts(normalize=True) * 100), 3)[:10]

OTHER                                                       53.450
INDIANA UNIVERSITY SCHOOL OF MEDICINE                        0.668
WAYNE STATE UNIVERSITY SCHOOL OF MEDICINE                    0.661
PALMER COLLEGE CHIROPRACTIC - DAVENPORT                      0.621
PHILADELPHIA COLLEGE OF OSTEOPATHIC MEDICINE                 0.565
UNIVERSITY OF MINNESOTA MEDICAL SCHOOL                       0.562
TEMPLE UNIVERSITY SCHOOL OF MEDICINE                         0.543
OHIO STATE UNIVERSITY COLLEGE OF MEDICINE                    0.543
UNIVERSITY OF ILLINOIS AT CHICAGO HEALTH SCIENCE CENTER      0.513
JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON UNIVERSITY     0.491
Name: med_sch, dtype: float64

In [72]:
# let's look a little bit further down the list (and return to counts)
school_count['med_sch'].value_counts(normalize=False)[10:20]

NEW YORK MEDICAL COLLEGE                                     5390
UNIVERSITY OF MICHIGAN MEDICAL SCHOOL                        5338
GEORGETOWN UNIVERSITY OF MEDICINE                            5111
UNIVERSITY OF ALABAMA SCHOOL OF MEDICINE                     4976
UNIVERSITY OF KANSAS SCHOOL OF MEDICINE                      4896
MEDICAL COLLEGE OF GEORGIA                                   4882
UNIVERSITY OF TENNESSEE COLLEGE OF MEDICINE                  4823
UNIVERSITY OF CINCINNATI COLLEGE OF MEDICINE                 4771
UNIVERSITY OF TEXAS SOUTHWESTERN MEDICAL SCHOOL AT DALLAS    4708
UNIVERSITY OF TEXAS MEDICAL BRANCH AT GALVESTON              4690
Name: med_sch, dtype: int64

In [73]:
# a little bit further...
school_count['med_sch'].value_counts(normalize=False)[20:30]

NEW YORK UNIVERSITY SCHOOL OF MEDICINE                          4630
UNIVERSITY OF WASHINGTON SCHOOL OF MEDICINE                     4524
LOUISIANA STATE UNIVERSITY SCHOOL OF MEDICINE IN NEW ORLEANS    4483
MEDICAL COLLEGE OF WISCONSIN                                    4471
UNIVERSITY OF IOWA COLLEGE OF MEDICINE                          4410
BAYLOR COLLEGE OF MEDICINE                                      4344
UNIVERSITY OF WISCONSIN MEDICAL SCHOOL                          4262
BOSTON UNIVERSITY SCHOOL OF MEDICINE                            4203
NORTHWESTERN UNIVERSITY MEDICAL SCHOOL                          4168
TUFTS UNIVERSITY SCHOOL OF MEDICINE                             4152
Name: med_sch, dtype: int64

In [75]:
data.tail()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn,full_nm
2210785,1992999825,143414284,I20101102000933,DESCHENES,GEOFFREY,R,,M,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2005.0,OTOLARYNGOLOGY,,,,,,VIRGINIA MASON MEDICAL CENTER,9830003000.0,776.0,19116 33RD AVE W,VIRGINIA MASON LYNNWOOD,,LYNNWOOD,WA,980364706,4257128000.0,500005,VIRGINIA MASON MEDICAL CENTER,,,,,,,,,Y,GEOFFREY R DESCHENES
2210786,1992999825,143414284,I20101102000933,DESCHENES,GEOFFREY,R,,M,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2005.0,OTOLARYNGOLOGY,,,,,,VIRGINIA MASON MEDICAL CENTER,9830003000.0,776.0,925 SENECA ST,VIRGINIA MASON HOSPITAL,,SEATTLE,WA,981012742,2062237000.0,500005,VIRGINIA MASON MEDICAL CENTER,,,,,,,,,Y,GEOFFREY R DESCHENES
2210787,1992999825,143414284,I20101102000933,DESCHENES,GEOFFREY,R,,M,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2005.0,OTOLARYNGOLOGY,,,,,,VIRGINIA MASON MEDICAL CENTER,9830003000.0,776.0,1100 9TH AVE,,,SEATTLE,WA,981012756,,500005,VIRGINIA MASON MEDICAL CENTER,,,,,,,,,Y,GEOFFREY R DESCHENES
2210788,1992999825,143414284,I20101102000933,DESCHENES,GEOFFREY,R,,M,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2005.0,OTOLARYNGOLOGY,,,,,,VIRGINIA MASON MEDICAL CENTER,9830003000.0,776.0,4575 SANDPOINT WAY NE,,Y,SEATTLE,WA,981053999,2065258000.0,500005,VIRGINIA MASON MEDICAL CENTER,,,,,,,,,Y,GEOFFREY R DESCHENES
2210789,1992999825,143414284,I20101102000933,DESCHENES,GEOFFREY,R,,M,,JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON ...,2005.0,OTOLARYNGOLOGY,,,,,,VIRGINIA MASON MEDICAL CENTER,9830003000.0,776.0,4575 SAND POINT WAY NE,108 VIRGINIA MASON SAND POINT,,SEATTLE,WA,981053999,2065258000.0,500005,VIRGINIA MASON MEDICAL CENTER,,,,,,,,,Y,GEOFFREY R DESCHENES


In [77]:
# slight modification to column order, adding full name after other name columns
new_order = ['npi', 'ind_pac_id', 'ind_enrl_id', 'lst_nm', 'frst_nm', 'mid_nm', 'suff', 'full_nm', 
             'gndr', 'cred', 'med_sch', 'grd_yr', 'pri_spec', 'sec_spec_1', 'sec_spec_2', 
             'sec_spec_3', 'sec_spec_4', 'sec_spec_all', 'org_lgl_nm', 'org_pac_id', 
             'num_org_mem', 'adr_ln_1', 'adr_ln_2', 'ln_2_sprs', 'cty', 'st', 'zip', 
             'phn_numbr', 'hosp_afl_1', 'hosp_afl_lbn_1', 'hosp_afl_2', 'hosp_afl_lbn_2', 
             'hosp_afl_3', 'hosp_afl_lbn_3', 'hosp_afl_4', 'hosp_afl_lbn_4', 'hosp_afl_5', 
             'hosp_afl_lbn_5', 'assgn']

# assign new column order to data
data = data[new_order]

In [78]:
# check it to make sure it worked
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,lst_nm,frst_nm,mid_nm,suff,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,ARDALAN ENKESHAFI,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ENKESHAFI,ARDALAN,,,ARDALAN ENKESHAFI,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,ARDALAN ENKESHAFI,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,ARDALAN ENKESHAFI,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ENKESHAFI,ARDALAN,,,ARDALAN ENKESHAFI,M,,OTHER,1994.0,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112,CJW MEDICAL CENTER,210028,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


In [79]:
# export current data to CSV for future analysis
data.to_csv('physician_compare_national-updates-1.csv', index=False)