### Goal 3, Question 1:
For entity type __Individual__, what is the "normal basket" of __procedures__ for each **provider type**?

#### Steps:
    1. read in csv and select rows that are ['Entity Type of the Provider'] == 'I']
    2. select necessary columns, for clarity
    3. Create query that returns top 15 procedures for given provider type

In [1]:
import pandas as pd
import pickle

#### read csv, select rows with Entity Type 'I' 

In [2]:
%%time

individual_providers_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    individual_providers_rows.append(chunk[chunk['Entity Type of the Provider'] == 'I'])
    
               
                
individual_type_providers = pd.concat(individual_providers_rows, ignore_index=True)

Wall time: 2min 7s


#### select columns:
    - National Provider Identifier
    - Entity Type of the Provider
    - Provider Type
    - Place of Service
    - HCPCS Code
    - HCPCS Description
    - Number of Services
    - Number of Medicare Beneficiaries
    - Number of Distinct Medicare Beneficiary/Per Day Services

In [3]:
individual_type_providers_reduced_columns = individual_type_providers[['National Provider Identifier','Entity Type of the Provider',
                                                             'Provider Type','Place of Service','HCPCS Code',
                                                             'HCPCS Description','Number of Services',
                                                             'Number of Medicare Beneficiaries',
                                                             'Number of Distinct Medicare Beneficiary/Per Day Services']]

In [4]:
individual_type_providers_reduced_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9416125 entries, 0 to 9416124
Data columns (total 9 columns):
 #   Column                                                    Dtype  
---  ------                                                    -----  
 0   National Provider Identifier                              int64  
 1   Entity Type of the Provider                               object 
 2   Provider Type                                             object 
 3   Place of Service                                          object 
 4   HCPCS Code                                                object 
 5   HCPCS Description                                         object 
 6   Number of Services                                        float64
 7   Number of Medicare Beneficiaries                          int64  
 8   Number of Distinct Medicare Beneficiary/Per Day Services  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 646.6+ MB


#### Diagnostic Radiology has the most rows, lets start by isolating that provider type...

In [5]:
individual_type_providers_reduced_columns['Provider Type'].value_counts()

Diagnostic Radiology                   1241400
Internal Medicine                      1118171
Family Practice                         969268
Nurse Practitioner                      560219
Cardiology                              445088
                                        ...   
Ambulance Service Provider                  42
Unknown Supplier/Provider Specialty         17
All Other Suppliers                         11
Medical Toxicology                           2
Slide Preparation Facility                   2
Name: Provider Type, Length: 88, dtype: int64

In [6]:
diag_radiology_providers = individual_type_providers_reduced_columns.loc[individual_type_providers_reduced_columns
                                              ['Provider Type'] == 'Diagnostic Radiology']
diag_radiology_providers.shape

(1241400, 9)

#### remove 99200 and 99300 codes

In [13]:
%%time
diag_radiology_providers = diag_radiology_providers.loc[~diag_radiology_providers['HCPCS Code'].str.contains('992..|993..', regex=True)]
diag_radiology_providers.shape

Wall time: 826 ms


(1238760, 9)

#### Next, we need to do a value count of each HCPCS for radiology...

In [7]:
diag_radiology_providers['HCPCS Code'].nunique()

1174

In [10]:
diag_radiology_providers.to_csv('diag_radio', index = False)