# _Physician Compare National: Explore #5_

This notebook is a continuation from of my analysis on the following data gathered via [Data.Medicare.gov](https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6). It contains general information about individual eligible professionals (EPs) such as demographic information and Medicare quality program participation. This dataset is updated twice a month with the most current demographic information available at that time.

# _Today's Goal_

My primary goal is to intial cleaning of the data set. There are still multiple columns I have not looked at so further exploration into the unique values within each column, and any missing values, need to be assessed and dealth with.

A secondary goal is to figure out a way to better format the zip codes column; further, I will look for a way to add information regarding county level data, whether that is through ZIP or FIPS codes. Time permitting, I could also begin initial trials of putting together a visual via `plot.ly`.

In [32]:
# first thing we need to do --> load in the data
# import pandas
import pandas as pd
pd.options.display.max_columns = None
%load_ext autoreload
%autoreload 2

# import data from yesterday
data = pd.read_csv('physician_compare_national-updates-1.csv', low_memory=False);

# inspect the first five rows
data.head()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


To ensure all the IDs and the `full_nm` column are good to go, let's check if there were any missing values in the first four columns. We can also take a look at the total number of unique IDs and names there are in the data set. 

In [33]:
# check for missing values
data[['npi', 'ind_pac_id', 'ind_enrl_id', 'full_nm']].isna().sum()

npi            0
ind_pac_id     0
ind_enrl_id    0
full_nm        0
dtype: int64

In [34]:
# no missing values! No let's take a look at how many unique values we have in each column
for col in data[['npi', 'ind_pac_id', 'ind_enrl_id', 'full_nm']]:
    unique = len(data[col].unique())
    print('There are {} unique values in the {} column.'.format(unique, col))

There are 1103330 unique values in the npi column.
There are 1103330 unique values in the ind_pac_id column.
There are 1147121 unique values in the ind_enrl_id column.
There are 1072060 unique values in the full_nm column.


As we can see there is difference in the number of unique observations amongst the first four columns. The reason why this is important to look into further is that we need to use a column to group observations. In previous notebooks, we have seen that the same physician will have multiple rows, just take a look at the first five rows printed out above! 

However, since there are no missing values in any of these columns we can move onto cleaning the others.

In [35]:
# take a look at how many missing values there are in gender column
print('There are {} missing values in the gender column'.format(data['gndr'].isna().sum()))

There are 0 missing values in the gender column


In [36]:
# get the relative value counts in the gender column
round((data['gndr'].value_counts(normalize=True)) * 100, 3)

M    55.359
F    44.641
Name: gndr, dtype: float64

Our physician data set is approximately 55% male and ~45% female. With no missing values lets move onto the `cred` column.

In [37]:
# take a look at the absolute number of missing values in credential column
data['cred'].isna().sum()

1524908

In [38]:
# what percentage of the cred column is missing values?
print('Approximately {}% of the cred column is missing.'.format(round((data['cred'].isna().mean()) * 100, 3)))

Approximately 68.976% of the cred column is missing.


In [39]:
# how many unique values are there in the cred column?
print('There are {} unique values in the credential column.'.format(len(data['cred'].unique())))

There are 22 unique values in the credential column.


In [40]:
# what are the unique values in the credential column?
for cred in data['cred'].unique():
    print(cred)

nan
PA
PT
CNS
MD
CSW
CNA
DC
NP
OD
CP
DO
CNM
AU
OT
MNT
DPM
AA
DDS
PSY
DDM
SCW


Above are all the unique values in the credential column. The first one is `nan` which means '_not a number_' and is something we'll address in a little bit. There is two more issues: the first being that we need to do some research about what each of these mean; secondly, there appears to be a leading whitespcate in some of the values. 

Before we start our research, let's go ahead and replace `NaN` but how do we want to approach this? We don't want to replace it with `None` as being a medical professional virtually always requires some type of certification to ensure that particular individual has the requisite skills to care for patients. However, it is not plausible to research each and every physician to determine what school they graduated from. 

However, before we go any further lets check out the `grd_yr` column to see if there are any missing values. Essentially, `grd_yr` can serve as a proxy for `cred` in that we can firt confirm that an individual graduated from a program in the first place. If there are no missing values, this represents the case that all graduated from somewhere, we just don't know where that somewhere is. 

In [44]:
# check grad_yr for missing values
print('There are {} missing values in the graduation year column.'.format(data['grd_yr'].isna().sum()))

There are 0 missing values in the graduation year column.


In [48]:
# print out the unique graduation years
for year in sorted(data['grd_yr'].unique()):
    print(year)

0
1944
1945
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020


A-ha! Despite having no missing values it looks like there are values of `0` in the column as indicated by a print out of all the unique values! Now I remember that a few notebooks back I looked into the `grd_yr` column and reassigned the missing values to `0`. This was done in an attempt to preserve any other information that might be useful in that particular observation. 

Let's take a quick look and see how many observations have a value of `0` for `grd_yr`.

In [54]:
# how many observations have a value of 0 in grd_yr?
zero = len(data[data['grd_yr'] == 0])
print('There are {} observations in the grd_yr column that have a value of 0.'.format(zero))

There are 5199 observations in the grd_yr column that have a value of 0.


In [57]:
# as a reminder, how many observations are missing in the cred column?
print('There are {} missing values in the credential column.'.format(data['cred'].isna().sum()))

There are 1524908 missing values in the credential column.


As we can see, there is a disjoint between the graduation date and credential data. The vast majority of observations seem to have a graduation year, with only _~0.24%_ having a missing value (i.e., corresponding to a value of `0`. In comparison to the `cred` column, missing values comprise _~68.98%_ of observations. 

What is this telling us? That virtually all the physicians graduated from _some institution_ but we aren't sure exactly which institution that is. Now, what can we do? It is highly unfeasible to go through every observation, look up the physician's name, and then determine what credential they graduated with. While it would provide a complete picture of the data, the time one would have to invest is significant, so that throws this option out of the window. 

What can we do? Something similar to what we did in the `grd_yr` column when we labeled the missing values as `0`. In this particular case, though we can substitute something like `not listed` seeing as the data type is string-oriented. That way we can keep the other information for that physician; plus if we take a look at the `pri_spec` column, which gives us the primary specialty of that physician, we can perhaps get a rough inference of what that particular physician may have a credential in. 

In [64]:
# example of how to fill na's 
data['cred'].fillna('Not Listed')[:5]

0    Not Listed
1    Not Listed
2    Not Listed
3    Not Listed
4    Not Listed
Name: cred, dtype: object

In [65]:
# lets go ahead and apply the above to the credential column
data['cred'] = data['cred'].fillna('Not Listed'); data['cred'][:5]

0    Not Listed
1    Not Listed
2    Not Listed
3    Not Listed
4    Not Listed
Name: cred, dtype: object

In [66]:
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


Columns `npi` through `grd_yr` have been addressed; now, let us take a look at the `pri_spec` column.

In [69]:
# how many missing values are there in the pri_spec column?
print('There are {} missing values in the primary specialty column.'.format(data['pri_spec'].isna().sum()))

There are 0 missing values in the primary specialty column.


In [72]:
# how many unique values are there in the pri_spec column?
print('Number of unique values in primary specialty column: {}'.format(len(data['pri_spec'].unique())))

Number of unique values in primary specialty column: 83


With 83 unique values, we have quite a few specialty's listed. Lets print out the unique values and then get the value counts for each.

In [76]:
# print out the unqiue specialties in the pri_spec column, sorted
for specialty in sorted(data['pri_spec'].unique()):
    print(specialty)

ADDICTION MEDICINE
ADVANCED HEART FAILURE AND TRANSPLANT CARDIOLOGY
ALLERGY/IMMUNOLOGY
ANESTHESIOLOGY
ANESTHESIOLOGY ASSISTANT
CARDIAC ELECTROPHYSIOLOGY
CARDIAC SURGERY
CARDIOVASCULAR DISEASE (CARDIOLOGY)
CERTIFIED CLINICAL NURSE SPECIALIST (CNS)
CERTIFIED NURSE MIDWIFE (CNM)
CERTIFIED REGISTERED NURSE ANESTHETIST (CRNA)
CHIROPRACTIC
CLINICAL SOCIAL WORKER
COLORECTAL SURGERY (PROCTOLOGY)
CRITICAL CARE (INTENSIVISTS)
DENTIST
DERMATOLOGY
DIAGNOSTIC RADIOLOGY
EMERGENCY MEDICINE
ENDOCRINOLOGY
FAMILY MEDICINE
GASTROENTEROLOGY
GENERAL PRACTICE
GENERAL SURGERY
GERIATRIC MEDICINE
GERIATRIC PSYCHIATRY
GYNECOLOGICAL ONCOLOGY
HAND SURGERY
HEMATOLOGY
HEMATOLOGY/ONCOLOGY
HEMATOPOIETIC CELL TRANSPLANTATION AND CELLULAR TH
HOSPICE/PALLIATIVE CARE
HOSPITALIST
INFECTIOUS DISEASE
INTERNAL MEDICINE
INTERVENTIONAL CARDIOLOGY
INTERVENTIONAL PAIN MANAGEMENT
INTERVENTIONAL RADIOLOGY
MAXILLOFACIAL SURGERY
MEDICAL GENETICS AND GENOMICS
MEDICAL ONCOLOGY
MEDICAL TOXICOLOGY
NEPHROLOGY
NEUROLOGY
NEUROPSYCHIATRY
NE

Nothing too out of the ordinary with the specialties listed above. However, there is something to perhaps explore a little further. Near the end there are two values - `UNDEFINED NON-PHYSICIAN TYPE (SPECIFY)` & `UNDEFINED PHYSICIAN TYPE (SPECIFY)` that are essentially flagging unknown values. Let's take a quick detour and explore these specific values before going any further. 

In [77]:
# how many values in pri_spec column are of the undefined non-physician type?
len(data[data['pri_spec'] == 'UNDEFINED NON-PHYSICIAN TYPE (SPECIFY)'])

18

In [78]:
# how many values in pri_spec column are of the undefined physician type?
len(data[data['pri_spec'] == 'UNDEFINED PHYSICIAN TYPE (SPECIFY)'])

612

It looks like there are very few of these observations when compared to the overall data set (which is 2.1M rows). However, lets keep this in mind going forward as perhaps we might be able to find more information in another column, perhaps in one of the secondary specialty columns.

Next, we'll get back to the higher level overview of the `pri_spec` column by getting the value counts for each unique value in the column.

In [79]:
# return absolute value counts for top 20 values in primary specialty column
data['pri_spec'].value_counts()[:20]

NURSE PRACTITIONER                               235345
PHYSICIAN ASSISTANT                              203371
INTERNAL MEDICINE                                182186
FAMILY MEDICINE                                  158359
DIAGNOSTIC RADIOLOGY                             156303
PHYSICAL THERAPY                                  84096
CERTIFIED REGISTERED NURSE ANESTHETIST (CRNA)     79246
ANESTHESIOLOGY                                    75169
CARDIOVASCULAR DISEASE (CARDIOLOGY)               66414
OBSTETRICS/GYNECOLOGY                             59125
CLINICAL SOCIAL WORKER                            49860
ORTHOPEDIC SURGERY                                45332
OPTOMETRY                                         44946
GENERAL SURGERY                                   44396
CHIROPRACTIC                                      42595
PSYCHIATRY                                        39718
OPHTHALMOLOGY                                     34855
NEUROLOGY                                       

Quickly, what is wrong with the above?

YOU FORGOT THAT THERE IS DUPLICATE ENTRIES, i.e. there are physicians that are listed multiple times and will thus inflate the numbers for whatever the `pri_spec` is! 

To get a more accurate picture we would first need to groupby the physician and then we can return a more accurate value count.

In [87]:
# group by physician (condensing down duplicate physicians) and grab their associated pri_spec
data.groupby('full_nm')['pri_spec'].first()[:10]

full_nm
A  MIRANDA                       MEDICAL ONCOLOGY
A  NARASIMHA RAO                        NEUROLOGY
A ALAN SEMINE                DIAGNOSTIC RADIOLOGY
A ANANTH  RAMAN                 PULMONARY DISEASE
A ANDREW RUDMANN                INTERNAL MEDICINE
A B  M MASUDUR  RAHMAN            FAMILY MEDICINE
A BENEDICT  COSIMI              SURGICAL ONCOLOGY
A BETHEL  GILBERT          CLINICAL SOCIAL WORKER
A BRANT LIPSCOMB JR.           ORTHOPEDIC SURGERY
A C BADDER                       PHYSICAL THERAPY
Name: pri_spec, dtype: object

In [89]:
# make a dataframe of the above and reset the index
pri_spec_df = pd.DataFrame(data.groupby('full_nm')['pri_spec'].first()).reset_index(); pri_spec_df.head()

Unnamed: 0,full_nm,pri_spec
0,A MIRANDA,MEDICAL ONCOLOGY
1,A NARASIMHA RAO,NEUROLOGY
2,A ALAN SEMINE,DIAGNOSTIC RADIOLOGY
3,A ANANTH RAMAN,PULMONARY DISEASE
4,A ANDREW RUDMANN,INTERNAL MEDICINE


In [90]:
# adjust column names
pri_spec_df.columns = ['name', 'primary_specialty']

In [91]:
# now return value count in primary specialty column, similar to how we did above
pri_spec_df['primary_specialty'].value_counts()[:10]

NURSE PRACTITIONER                               130090
INTERNAL MEDICINE                                 89304
FAMILY MEDICINE                                   82162
PHYSICIAN ASSISTANT                               73635
PHYSICAL THERAPY                                  57298
CERTIFIED REGISTERED NURSE ANESTHETIST (CRNA)     41544
CLINICAL SOCIAL WORKER                            39863
CHIROPRACTIC                                      39173
ANESTHESIOLOGY                                    37146
OPTOMETRY                                         31212
Name: primary_specialty, dtype: int64

As we can see, by reformatting the data to condense duplicate entries down to one for each physician in the data, we cut many of the values from before nearly in half! This is the more accurate count of the primary specialty for each person in the data set. We'll take note of this for later use however since this column - i.e., `pri_spec` in original `data` DataFrame - is void of missing values we can move forward with our cleaning process to the columns associated with secondary specialties. 

In [92]:
# check the missing values in the columns associated with second specialties
data[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4', 'sec_spec_all']].isna().sum()

sec_spec_1      1891912
sec_spec_2      2175539
sec_spec_3      2207106
sec_spec_4      2210199
sec_spec_all    1891912
dtype: int64

In [94]:
# same as above but relayed as a percentage of the column
round((data[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4', 'sec_spec_all']].isna().mean()) * 100, 3)

sec_spec_1      85.576
sec_spec_2      98.406
sec_spec_3      99.833
sec_spec_4      99.973
sec_spec_all    85.576
dtype: float64

It appears that a very large percentage of all these columns don't have any values. Upon a second look though, it appears that we might have a column - `sec_spec_all` - that combines all the secondary specialties into one. Let's check it out further to either confirm or refute this.

In [96]:
# loop through all the secondary specialty columns to see how many unique values they have
for col in data[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4', 'sec_spec_all']]:
    unique = len(data[col].unique())
    print('Number of unique values in {}: {}'.format(col, unique))

Number of unique values in sec_spec_1: 69
Number of unique values in sec_spec_2: 64
Number of unique values in sec_spec_3: 56
Number of unique values in sec_spec_4: 43
Number of unique values in sec_spec_all: 1556


This is further evidence towards the `sec_spec_all` column containing all the combinations of all the physicians secondary specialties. Let's dive further into each one through as it might gives us some clues as to what are the most common secondary specialties that physicians have.

In [101]:
# remember: we have to reformat the dataframe so that it combines all the physicians and their duplicates
# into one observation
specialty_df = pd.DataFrame(data.groupby('full_nm')[['pri_spec', 
                                                     'sec_spec_1', 
                                                     'sec_spec_2', 
                                                     'sec_spec_3', 
                                                     'sec_spec_4', 
                                                     'sec_spec_all']].first()).reset_index()

In [102]:
# check out the first few lines of the dataframe we just created
specialty_df.head()

Unnamed: 0,full_nm,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all
0,A MIRANDA,MEDICAL ONCOLOGY,,,,,
1,A NARASIMHA RAO,NEUROLOGY,,,,,
2,A ALAN SEMINE,DIAGNOSTIC RADIOLOGY,,,,,
3,A ANANTH RAMAN,PULMONARY DISEASE,SLEEP MEDICINE,,,,SLEEP MEDICINE
4,A ANDREW RUDMANN,INTERNAL MEDICINE,,,,,


In [106]:
# loop through each numbered secondary specialty column and retrieve top 10 value counts
for col in specialty_df[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4']]:
    print('Top 10 Value Counts of {} column'.format(col))
    print(' ')
    print(specialty_df[col].value_counts()[:10])
    print('-' * 30)

Top 10 Value Counts of sec_spec_1 column
 
INTERNAL MEDICINE                      46297
CRITICAL CARE (INTENSIVISTS)            8450
PEDIATRIC MEDICINE                      6556
CARDIOVASCULAR DISEASE (CARDIOLOGY)     6494
FAMILY MEDICINE                         4185
GENERAL SURGERY                         4163
GERIATRIC MEDICINE                      3350
EMERGENCY MEDICINE                      3306
PAIN MANAGEMENT                         3011
SPORTS MEDICINE                         2630
Name: sec_spec_1, dtype: int64
------------------------------
Top 10 Value Counts of sec_spec_2 column
 
INTERNAL MEDICINE          5251
PULMONARY DISEASE          1189
PAIN MANAGEMENT             710
PEDIATRIC MEDICINE          617
MEDICAL ONCOLOGY            458
SLEEP MEDICINE              446
VASCULAR SURGERY            437
NUCLEAR MEDICINE            310
HOSPICE/PALLIATIVE CARE     273
GENERAL PRACTICE            252
Name: sec_spec_2, dtype: int64
------------------------------
Top 10 Value Counts 

A quick look yields that `INTERNAL MEDICINE` is a value that is in the top 10 of all of the secondary specialty columns. Let's see what's in store for us in the `sec_spec_all` column.

In [110]:
# get top 20 value counts (normalized to show percentage) of the sec_spec_all column
(specialty_df['sec_spec_all'].value_counts(normalize=True) * 100)[:20]

INTERNAL MEDICINE                                  33.369510
PEDIATRIC MEDICINE                                  4.895843
CRITICAL CARE (INTENSIVISTS)                        3.956769
CARDIOVASCULAR DISEASE (CARDIOLOGY)                 3.722378
GENERAL SURGERY                                     2.898616
FAMILY MEDICINE                                     2.811190
PAIN MANAGEMENT                                     2.187151
GERIATRIC MEDICINE                                  2.153236
EMERGENCY MEDICINE                                  2.147207
SPORTS MEDICINE                                     1.980646
HOSPITALIST                                         1.707817
ANESTHESIOLOGY                                      1.493775
CRITICAL CARE (INTENSIVISTS), INTERNAL MEDICINE     1.407103
PULMONARY DISEASE                                   1.381478
INTERVENTIONAL RADIOLOGY                            1.352838
ORTHOPEDIC SURGERY                                  1.122969
DIAGNOSTIC RADIOLOGY    

For the most part it appears of the physicians that do have a secondary specialty a large portion tend to have one, `INTERNAL MEDICINE`. Another thing to note is that there is only one entry - `CRITICAL CARE (INTENSIVISTS), INTERNAL MEDICINE` - that lists more than one secondary specialty. This is an indicator that it appears most physicians don't tend to have more than one other specialty in addition to their primary one. 

But lets get back to what we were originally seeking to do: fill in the `NaN`s in the secondary specialty columns. For these columns we can go ahead and replace `NaN` with `None` as it most likely to be the case that a particular physician doesn't have a secondary specialty if there is a value of `NaN`. 

In [119]:
# replace NaN in all the secondary specialty columns with None
data[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4', 'sec_spec_all']] = data[['sec_spec_1', 'sec_spec_2', 'sec_spec_3', 'sec_spec_4', 'sec_spec_all']].fillna('None')

Now that we've cleaned up more than a few columns, lets take a look at the first few rows of `data` and see how it is shaping up.

In [120]:
data.head()

Unnamed: 0,npi,ind_pac_id,ind_enrl_id,full_nm,gndr,cred,med_sch,grd_yr,pri_spec,sec_spec_1,sec_spec_2,sec_spec_3,sec_spec_4,sec_spec_all,org_lgl_nm,org_pac_id,num_org_mem,adr_ln_1,adr_ln_2,ln_2_sprs,cty,st,zip,phn_numbr,hosp_afl_1,hosp_afl_lbn_1,hosp_afl_2,hosp_afl_lbn_2,hosp_afl_3,hosp_afl_lbn_3,hosp_afl_4,hosp_afl_lbn_4,hosp_afl_5,hosp_afl_lbn_5,assgn
0,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1850 TOWN CTR PKWY,,,RESTON,VA,201903219,7036899000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
1,1003000126,7517003643,I20130530000085,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,1701 N GEORGE MASON DR,,,ARLINGTON,VA,222053610,7035586000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
2,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,EMERGENCY MEDICINE ASSOCIATES PA PC,8022915000.0,182.0,24440 STONE SPRINGS BLVD,,,DULLES,VA,201662247,5713674000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
3,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,1401 JOHNSTON WILLIS DR,,,NORTH CHESTERFIELD,VA,232354730,8044835000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y
4,1003000126,7517003643,I20150824000105,ARDALAN ENKESHAFI,M,Not Listed,OTHER,1994,INTERNAL MEDICINE,,,,,,SOUTHEASTERN INTENSIVIST SERVICES PC,9335152000.0,133.0,411 W RANDOLPH RD,,,HOPEWELL,VA,238602938,8045412000.0,490112.0,CJW MEDICAL CENTER,210028.0,MEDSTAR SAINT MARY'S HOSPITAL,,,,,,,Y


In [123]:
# check the percentage of missing values in the columns we've cleaned so far
data.iloc[:, :14].isna().mean() * 100

npi             0.0
ind_pac_id      0.0
ind_enrl_id     0.0
full_nm         0.0
gndr            0.0
cred            0.0
med_sch         0.0
grd_yr          0.0
pri_spec        0.0
sec_spec_1      0.0
sec_spec_2      0.0
sec_spec_3      0.0
sec_spec_4      0.0
sec_spec_all    0.0
dtype: float64

In [124]:
# save updates to CSV so that tomorrow we can pick up where we left off
data.to_csv('physician_compare_national-updates-1.csv', index=False)

# _Work Completed_

Due to time constraints, I'm going to cut off this notebook here. Above we saved our updates into a CSV so that tomorrow we can just read it in via `pandas` and start right back up where we left off. Below is an overview of the steps that still need to be completed:
1. Examine all columns after, and including, `org_lgl_nm`
2. Determine if any cleaning is needed on each respective column
3. If cleaning is necessary, determine best how to address missing/unusual values