## Maven Hospital Challenge

#### Challenge Objective

For the Maven Hospital Challenge, you'll play the role of an Analytics Consultant for Massachusetts General Hospital (MGH).

You've been asked to build a high-level KPI report for the executive team, based on a subset of patient records. The purpose of the report is to give stakeholders visibility into the hospital's recent performance, and answer the following questions:

 - How many patients have been admitted or readmitted over time?
 - How long are patients staying in the hospital, on average?
 - How much is the average cost per visit?
 - How many procedures are covered by insurance?

The dashboard should scale to accommodate new data over time, but the CEO has asked you to summarize any insights you can derive from the sample provided.

#### About The Data Set
Synthetic data on ~1k patients of Massachussets General Hospital from 2011-2022, including information on patient demographics, insurance coverage, and medical encounters & procedures.

In [1]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Let's import all the datasets required for this task
patient_encounters = pd.read_csv("encounters.csv")
# hospital_details = pd.read_csv("organizations.csv") No Need to import as there is only one hospital
patient_demographic = pd.read_csv("patients.csv")
insurance_payer = pd.read_csv("payers.csv")
operating_procedures = pd.read_csv("procedures.csv")

In [3]:
# Let's check the number of features and records each dataset has
print("patient encounters data:", patient_encounters.shape)
# print("hospital details:", hospital_details.shape)
print("patient demographic:", patient_demographic.shape)
print("insurance payer:", insurance_payer.shape)
print("operating procedures:", operating_procedures.shape)

patient encounters data: (27891, 14)
patient demographic: (974, 20)
insurance payer: (10, 7)
operating procedures: (47701, 9)


In [4]:
# remove columns that are not necesssary for our analysis from patient_encounters
patient_encounters = patient_encounters.drop(['ORGANIZATION'], axis=1)

# remove columns that are not necesssary for our analysis from patient_demographic
patient_demographic = patient_demographic.drop(['PREFIX', 'SUFFIX', 'MAIDEN', 'BIRTHPLACE', 'ADDRESS', 'STATE', 'ZIP'], axis=1)

# remove columns that are not necesssary for our analysis from insurance_payer
insurance_payer = insurance_payer.drop(['ADDRESS', 'CITY', 'STATE_HEADQUARTERED', 'ZIP', 'PHONE'], axis=1)

In [5]:
# Let's look at some few records from each dataset
print("patient encounters:")
patient_encounters.head(5)

patient encounters:


Unnamed: 0,Id,START,STOP,PATIENT,PAYER,ENCOUNTERCLASS,CODE,DESCRIPTION,BASE_ENCOUNTER_COST,TOTAL_CLAIM_COST,PAYER_COVERAGE,REASONCODE,REASONDESCRIPTION
0,32c84703-2481-49cd-d571-3899d5820253,2011-01-02T09:26:36Z,2011-01-02T12:58:36Z,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,1018.02,0.0,,
1,c98059da-320a-c0a6-fced-c8815f3e3f39,2011-01-03T05:44:39Z,2011-01-03T06:01:42Z,d9ec2e44-32e9-9148-179a-1653348cc4e2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,outpatient,308335008,Patient encounter procedure,142.58,2619.36,0.0,,
2,4ad28a3a-2479-782b-f29c-d5b3f41a001e,2011-01-03T14:32:11Z,2011-01-03T14:47:11Z,73babadf-5b2b-fee7-189e-6f41ff213e01,7caa7254-5050-3b5e-9eae-bd5ea30e809c,outpatient,185349003,Encounter for check up (procedure),85.55,461.59,305.27,,
3,c3f4da61-e4b4-21d5-587a-fbc89943bc19,2011-01-03T16:24:45Z,2011-01-03T16:39:45Z,3b46a0b7-0f34-9b9a-c319-ace4a1f58c0b,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,wellness,162673000,General examination of patient (procedure),136.8,1784.24,0.0,,
4,a9183b4f-2572-72ea-54c2-b3cd038b4be7,2011-01-03T17:36:53Z,2011-01-03T17:51:53Z,fa006887-d93c-d302-8b89-f3c25f88c0e1,42c4fca7-f8a9-3cd1-982a-dd9751bf3e2a,ambulatory,390906007,Follow-up encounter,85.55,234.72,0.0,55822004.0,Hyperlipidemia


In [6]:
print("patient demographic:")
patient_demographic.head(5)

patient demographic:


Unnamed: 0,Id,BIRTHDATE,DEATHDATE,FIRST,LAST,MARITAL,RACE,ETHNICITY,GENDER,CITY,COUNTY,LAT,LON
0,5605b66b-e92d-c16c-1b83-b8bf7040d51f,1977-03-19,,Nikita578,Erdman779,M,white,nonhispanic,F,Quincy,Norfolk County,42.290937,-70.975503
1,6e5ae27c-8038-7988-e2c0-25a103f01bfa,1940-02-19,,Zane918,Hodkiewicz467,M,white,nonhispanic,M,Boston,Suffolk County,42.308831,-71.063162
2,8123d076-0886-9007-e956-d5864aa121a7,1958-06-04,,Quinn173,Marquardt819,M,white,nonhispanic,M,Quincy,Norfolk County,42.265177,-70.967085
3,770518e4-6133-648e-60c9-071eb2f0e2ce,1928-12-25,2017-09-29,Abel832,Smitham825,M,white,hispanic,M,Boston,Suffolk County,42.334304,-71.066801
4,f96addf5-81b9-0aab-7855-d208d3d352c5,1928-12-25,2014-02-23,Edwin773,Labadie908,M,white,hispanic,M,Boston,Suffolk County,42.346771,-71.058813


In [7]:
print("insurance payer:")
insurance_payer.head(10)

insurance payer:


Unnamed: 0,Id,NAME
0,b3221cfc-24fb-339e-823d-bc4136cbc4ed,Dual Eligible
1,7caa7254-5050-3b5e-9eae-bd5ea30e809c,Medicare
2,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,Medicaid
3,d47b3510-2895-3b70-9897-342d681c769d,Humana
4,6e2f1a2d-27bd-3701-8d08-dae202c58632,Blue Cross Blue Shield
5,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,UnitedHealthcare
6,4d71f845-a6a9-3c39-b242-14d25ef86a8d,Aetna
7,047f6ec3-6215-35eb-9608-f9dda363a44c,Cigna Health
8,42c4fca7-f8a9-3cd1-982a-dd9751bf3e2a,Anthem
9,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,NO_INSURANCE


In [8]:
print("operating procedures:") 
operating_procedures.head(5)

operating procedures:


Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,BASE_COST,REASONCODE,REASONDESCRIPTION
0,2011-01-02T09:26:36Z,2011-01-02T12:58:36Z,3de74169-7f67-9304-91d4-757e0f3a14d2,32c84703-2481-49cd-d571-3899d5820253,265764009,Renal dialysis (procedure),903,,
1,2011-01-03T05:44:39Z,2011-01-03T06:01:42Z,d9ec2e44-32e9-9148-179a-1653348cc4e2,c98059da-320a-c0a6-fced-c8815f3e3f39,76601001,Intramuscular injection,2477,,
2,2011-01-04T14:49:55Z,2011-01-04T15:04:55Z,d856d6e6-4c98-e7a2-129b-44076c63d008,2cfd4ddd-ad13-fe1e-528b-15051cea2ec3,703423002,Combined chemotherapy and radiation therapy (p...,11620,363406005.0,Malignant tumor of colon
3,2011-01-05T04:02:09Z,2011-01-05T04:17:09Z,bc9d59c3-0a30-6e3b-f47d-022e4f03c8de,17966936-0878-f4db-128b-a43ae10d0878,173160006,Diagnostic fiberoptic bronchoscopy (procedure),9796,162573006.0,Suspected lung cancer (situation)
4,2011-01-05T12:58:36Z,2011-01-05T16:42:36Z,3de74169-7f67-9304-91d4-757e0f3a14d2,9de5f0b0-4ba4-ce6f-45fb-b55c202f31a5,265764009,Renal dialysis (procedure),1255,,


In [9]:
# Let's check the datatypes present
print("patient encounters:")
patient_encounters.dtypes

patient encounters:


Id                      object
START                   object
STOP                    object
PATIENT                 object
PAYER                   object
ENCOUNTERCLASS          object
CODE                     int64
DESCRIPTION             object
BASE_ENCOUNTER_COST    float64
TOTAL_CLAIM_COST       float64
PAYER_COVERAGE         float64
REASONCODE             float64
REASONDESCRIPTION       object
dtype: object

In [10]:
print("patient demographic:")
patient_demographic.dtypes

patient demographic:


Id            object
BIRTHDATE     object
DEATHDATE     object
FIRST         object
LAST          object
MARITAL       object
RACE          object
ETHNICITY     object
GENDER        object
CITY          object
COUNTY        object
LAT          float64
LON          float64
dtype: object

In [11]:
print("insurance payer:")
insurance_payer.dtypes

insurance payer:


Id      object
NAME    object
dtype: object

In [12]:
print("operating procedures:") 
operating_procedures.dtypes

operating procedures:


START                 object
STOP                  object
PATIENT               object
ENCOUNTER             object
CODE                   int64
DESCRIPTION           object
BASE_COST              int64
REASONCODE           float64
REASONDESCRIPTION     object
dtype: object

In [13]:
# first we will convert some columns to proper datatypes
# patient_encounters
patient_encounters["START"] = pd.to_datetime(patient_encounters["START"]).dt.tz_localize(None)
patient_encounters["STOP"] = pd.to_datetime(patient_encounters["STOP"]).dt.tz_localize(None)
patient_encounters["CODE"] = patient_encounters["CODE"].astype("object")
patient_encounters["REASONCODE"] = patient_encounters["REASONCODE"].astype("object")

# hospital_details
# hospital_details["ZIP"] = hospital_details["ZIP"].astype("object")
# hospital_details["LAT"] = hospital_details["LAT"].astype("object")
# hospital_details["LON"] = hospital_details["LON"].astype("object")

# patient_demographic
patient_demographic["BIRTHDATE"] = pd.to_datetime(patient_demographic["BIRTHDATE"], format='%Y-%m-%d')
patient_demographic["DEATHDATE"] = pd.to_datetime(patient_demographic["DEATHDATE"], format='%Y-%m-%d')
patient_demographic["LAT"] = patient_demographic["LAT"].astype("object")
patient_demographic["LON"] = patient_demographic["LON"].astype("object")

# operating_procedures
operating_procedures["START"] = pd.to_datetime(operating_procedures["START"]).dt.tz_localize(None)
operating_procedures["STOP"] = pd.to_datetime(operating_procedures["STOP"]).dt.tz_localize(None)
operating_procedures["CODE"] = operating_procedures["CODE"].astype("object")
operating_procedures["REASONCODE"] = operating_procedures["REASONCODE"].astype("object")

In [14]:
# Descriptive Statistics
print("patient encounters:")
patient_encounters.describe(include="all").T

patient encounters:


Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
Id,27891.0,27891.0,32c84703-2481-49cd-d571-3899d5820253,1.0,NaT,NaT,,,,,,,
START,27891.0,27541.0,2016-12-08 10:00:40,3.0,2011-01-02 09:26:36,2022-02-05 20:27:36,,,,,,,
STOP,27891.0,27765.0,2016-12-08 10:15:40,3.0,2011-01-02 12:58:36,2022-02-05 20:42:36,,,,,,,
PATIENT,27891.0,974.0,1712d26d-822d-1e3a-2267-0a9dba31d7c8,1381.0,NaT,NaT,,,,,,,
PAYER,27891.0,10.0,7caa7254-5050-3b5e-9eae-bd5ea30e809c,11371.0,NaT,NaT,,,,,,,
ENCOUNTERCLASS,27891.0,6.0,ambulatory,12537.0,NaT,NaT,,,,,,,
CODE,27891.0,45.0,185347001.0,5261.0,NaT,NaT,,,,,,,
DESCRIPTION,27891.0,53.0,Encounter for problem (procedure),4308.0,NaT,NaT,,,,,,,
BASE_ENCOUNTER_COST,27891.0,,,,NaT,NaT,116.181614,28.410082,85.55,85.55,136.8,142.58,146.18
TOTAL_CLAIM_COST,27891.0,,,,NaT,NaT,3639.682174,9205.595748,0.0,142.58,278.58,1412.53,641882.7


In [15]:
print("patient demographic:") 
patient_demographic.describe(include="all").T

patient demographic:


Unnamed: 0,count,unique,top,freq,first,last
Id,974.0,974.0,5605b66b-e92d-c16c-1b83-b8bf7040d51f,1.0,NaT,NaT
BIRTHDATE,974.0,880.0,1925-11-17 00:00:00,4.0,1922-03-24,1991-11-27
DEATHDATE,154.0,148.0,2017-09-29 00:00:00,2.0,2011-02-03,2022-01-27
FIRST,974.0,842.0,Domenic627,3.0,NaT,NaT
LAST,974.0,498.0,Heaney114,6.0,NaT,NaT
MARITAL,973.0,2.0,M,784.0,NaT,NaT
RACE,974.0,6.0,white,680.0,NaT,NaT
ETHNICITY,974.0,2.0,nonhispanic,783.0,NaT,NaT
GENDER,974.0,2.0,M,494.0,NaT,NaT
CITY,974.0,29.0,Boston,541.0,NaT,NaT


In [16]:
print("insurance payer:")
insurance_payer.describe(include="all").T

insurance payer:


Unnamed: 0,count,unique,top,freq
Id,10,10,b3221cfc-24fb-339e-823d-bc4136cbc4ed,1
NAME,10,10,Dual Eligible,1


In [17]:
print("operating procedures:") 
operating_procedures.describe(include="all").T

operating procedures:


Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
START,47701.0,39251.0,2013-09-30 22:31:23,21.0,2011-01-02 09:26:36,2022-01-29 20:35:37,,,,,,,
STOP,47701.0,42263.0,2019-03-12 08:27:16,20.0,2011-01-02 12:58:36,2022-01-29 21:08:12,,,,,,,
PATIENT,47701.0,793.0,1712d26d-822d-1e3a-2267-0a9dba31d7c8,1783.0,NaT,NaT,,,,,,,
ENCOUNTER,47701.0,14670.0,66b2ab44-a2cc-8053-8f4e-c5be57e50cc4,186.0,NaT,NaT,,,,,,,
CODE,47701.0,157.0,710824005.0,4596.0,NaT,NaT,,,,,,,
DESCRIPTION,47701.0,163.0,Assessment of health and social care needs (pr...,4596.0,NaT,NaT,,,,,,,
BASE_COST,47701.0,,,,NaT,NaT,2212.064967,5572.978748,1.0,431.0,431.0,966.0,289531.0
REASONCODE,10756.0,46.0,72892002.0,5718.0,NaT,NaT,,,,,,,
REASONDESCRIPTION,10756.0,46.0,Normal pregnancy,5718.0,NaT,NaT,,,,,,,


In [18]:
# Let's look at the concise information on each dataset
print("patient encounters:")
patient_encounters.info()

patient encounters:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27891 entries, 0 to 27890
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Id                   27891 non-null  object        
 1   START                27891 non-null  datetime64[ns]
 2   STOP                 27891 non-null  datetime64[ns]
 3   PATIENT              27891 non-null  object        
 4   PAYER                27891 non-null  object        
 5   ENCOUNTERCLASS       27891 non-null  object        
 6   CODE                 27891 non-null  object        
 7   DESCRIPTION          27891 non-null  object        
 8   BASE_ENCOUNTER_COST  27891 non-null  float64       
 9   TOTAL_CLAIM_COST     27891 non-null  float64       
 10  PAYER_COVERAGE       27891 non-null  float64       
 11  REASONCODE           8350 non-null   object        
 12  REASONDESCRIPTION    8350 non-null   object        
dtypes: datetime

In [19]:
print("patient demographic:") 
patient_demographic.info()

patient demographic:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 974 entries, 0 to 973
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Id         974 non-null    object        
 1   BIRTHDATE  974 non-null    datetime64[ns]
 2   DEATHDATE  154 non-null    datetime64[ns]
 3   FIRST      974 non-null    object        
 4   LAST       974 non-null    object        
 5   MARITAL    973 non-null    object        
 6   RACE       974 non-null    object        
 7   ETHNICITY  974 non-null    object        
 8   GENDER     974 non-null    object        
 9   CITY       974 non-null    object        
 10  COUNTY     974 non-null    object        
 11  LAT        974 non-null    object        
 12  LON        974 non-null    object        
dtypes: datetime64[ns](2), object(11)
memory usage: 99.0+ KB


In [20]:
print("insurance payer:")
insurance_payer.info()

insurance payer:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Id      10 non-null     object
 1   NAME    10 non-null     object
dtypes: object(2)
memory usage: 288.0+ bytes


In [21]:
print("operating procedures:") 
operating_procedures.info()

operating procedures:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47701 entries, 0 to 47700
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   START              47701 non-null  datetime64[ns]
 1   STOP               47701 non-null  datetime64[ns]
 2   PATIENT            47701 non-null  object        
 3   ENCOUNTER          47701 non-null  object        
 4   CODE               47701 non-null  object        
 5   DESCRIPTION        47701 non-null  object        
 6   BASE_COST          47701 non-null  int64         
 7   REASONCODE         10756 non-null  object        
 8   REASONDESCRIPTION  10756 non-null  object        
dtypes: datetime64[ns](2), int64(1), object(6)
memory usage: 3.3+ MB


In [22]:
# Merge encounters with patients
encounters_patients = pd.merge(patient_encounters, patient_demographic, left_on='PATIENT', right_on='Id', how='inner', suffixes=('_enc', '_pat'))
# Merge the result with payers
encounters_patients_payers = pd.merge(encounters_patients, insurance_payer, left_on='PAYER', right_on='Id', how='inner', suffixes=('', '_pay'))

# Merge the result with procedures
final_dataset = pd.merge(encounters_patients_payers, operating_procedures, left_on='Id_enc', right_on='ENCOUNTER', how='outer', suffixes=('', '_proc'))

# Remove duplicate columns if necessary
final_dataset = final_dataset.loc[:,~final_dataset.columns.duplicated()]

# Display the final dataset shape
final_dataset.shape

(60922, 37)

In [23]:
final_dataset.columns

Index(['Id_enc', 'START', 'STOP', 'PATIENT', 'PAYER', 'ENCOUNTERCLASS', 'CODE',
       'DESCRIPTION', 'BASE_ENCOUNTER_COST', 'TOTAL_CLAIM_COST',
       'PAYER_COVERAGE', 'REASONCODE', 'REASONDESCRIPTION', 'Id_pat',
       'BIRTHDATE', 'DEATHDATE', 'FIRST', 'LAST', 'MARITAL', 'RACE',
       'ETHNICITY', 'GENDER', 'CITY', 'COUNTY', 'LAT', 'LON', 'Id', 'NAME',
       'START_proc', 'STOP_proc', 'PATIENT_proc', 'ENCOUNTER', 'CODE_proc',
       'DESCRIPTION_proc', 'BASE_COST', 'REASONCODE_proc',
       'REASONDESCRIPTION_proc'],
      dtype='object')

In [24]:
# Remove duplicate columns by selecting only one version of them
final_dataset = final_dataset.drop(columns=['PATIENT_proc', 'Id_pat', 'Id'])

# Rename columns for clarity and consistency
final_dataset.rename(columns={
    'Id_enc': 'Encounter_ID',
    'START': 'Encounter_Start',
    'STOP': 'Encounter_Stop',
    'PATIENT': 'Patient_ID',
    'PAYER': 'Payer_ID',
    'ENCOUNTERCLASS': 'Encounter_Class',
    'CODE': 'Encounter_Code',
    'DESCRIPTION': 'Encounter_Description',
    'BASE_ENCOUNTER_COST': 'Base_Encounter_Cost',
    'TOTAL_CLAIM_COST': 'Total_Claim_Cost',
    'PAYER_COVERAGE': 'Payer_Coverage',
    'REASONCODE': 'Encounter_ReasonCode',
    'REASONDESCRIPTION': 'Encounter_ReasonDescription',
    'NAME': 'Payer_Name',
    'START_proc': 'Procedure_Start', 
    'STOP_proc': 'Procedure_Stop',
    'CODE_proc': 'Procedure_Code',
    'DESCRIPTION_proc': 'Procedure_Description',
    'BASE_COST': 'Procedure_Base_Cost',
    'REASONCODE_proc': 'Procedure_ReasonCode',
    'REASONDESCRIPTION_proc': 'Procedure_ReasonDescription'
}, inplace=True)

In [25]:
# To calculate the age of patient's define the reference date as the last day of 2021 as there are very few records of the year 2022
from datetime import datetime
reference_date = datetime(2021, 12, 31)

# Calculate age for patients who are alive as of the end of 2021
final_dataset['Age_at_2021_End'] = (reference_date - final_dataset['BIRTHDATE']).dt.days // 365

# Calculate age at the time of death for patients who have a DEATHDATE
final_dataset['Age_at_Death'] = (final_dataset['DEATHDATE'] - final_dataset['BIRTHDATE']).dt.days // 365

# Replace NaN in 'Age_at_Death' with the 'Age_at_2021_End' where applicable
final_dataset['Age'] = final_dataset['Age_at_Death'].fillna(final_dataset['Age_at_2021_End'])

# Drop the temporary columns used for calculation
final_dataset.drop(columns=['Age_at_2021_End', 'Age_at_Death'], inplace=True)

In [26]:
# Also let's add a column that calculates the total encounter, procedure duration in hours
# Encounter Duration, Procedure Duration
final_dataset['Encounter_Duration'] = (((final_dataset['Encounter_Stop'] - final_dataset['Encounter_Start']).dt.total_seconds())/3600).round(2)
final_dataset['Procedure_Duration'] = (((final_dataset['Procedure_Stop'] - final_dataset['Procedure_Start']).dt.total_seconds())/3600).round(2)

In [27]:
# Fill missing values
final_dataset.fillna({'MARITAL': 'M', 
                      'Encounter_ReasonCode': 'Other', 'Encounter_ReasonDescription': 'Other', 'ENCOUNTER': 'Other', 
                      'Procedure_Code': 'Other', 'Procedure_Description': 'Other',
                      'Procedure_ReasonCode': 'Other', 'Procedure_ReasonDescription': 'Other',
                      'Procedure_Base_Cost':0, 'Procedure_Duration': 0}, inplace=True)

In [28]:
final_dataset.duplicated().sum()

0

In [29]:
final_dataset.drop_duplicates(inplace=True)

In [30]:
# check for missing records except for DEATHDATE, Procedure_Start, Procedure_Stop Column
final_dataset.isnull().sum()

Encounter_ID                       0
Encounter_Start                    0
Encounter_Stop                     0
Patient_ID                         0
Payer_ID                           0
Encounter_Class                    0
Encounter_Code                     0
Encounter_Description              0
Base_Encounter_Cost                0
Total_Claim_Cost                   0
Payer_Coverage                     0
Encounter_ReasonCode               0
Encounter_ReasonDescription        0
BIRTHDATE                          0
DEATHDATE                      49200
FIRST                              0
LAST                               0
MARITAL                            0
RACE                               0
ETHNICITY                          0
GENDER                             0
CITY                               0
COUNTY                             0
LAT                                0
LON                                0
Payer_Name                         0
Procedure_Start                13221
P

In [31]:
# save a csv file for futher use
final_dataset.to_csv("mgh_data.csv", index=False)

In [32]:
# import the final dataset using pandas
final_dataset = pd.read_csv("mgh_data.csv")
final_dataset.shape

(60922, 37)

In [33]:
# correct datatypes
final_dataset['Encounter_Start'] = pd.to_datetime(final_dataset['Encounter_Start'])
final_dataset['Encounter_Stop'] = pd.to_datetime(final_dataset['Encounter_Stop'])
final_dataset['Procedure_Start'] = pd.to_datetime(final_dataset['Procedure_Start'])
final_dataset['Procedure_Stop'] = pd.to_datetime(final_dataset['Procedure_Stop'])
final_dataset['Encounter_Code'] = final_dataset['Encounter_Code'].astype('object')
final_dataset['Procedure_Code'] = final_dataset['Procedure_Code'].astype('object')

In [34]:
final_dataset.dtypes

Encounter_ID                           object
Encounter_Start                datetime64[ns]
Encounter_Stop                 datetime64[ns]
Patient_ID                             object
Payer_ID                               object
Encounter_Class                        object
Encounter_Code                         object
Encounter_Description                  object
Base_Encounter_Cost                   float64
Total_Claim_Cost                      float64
Payer_Coverage                        float64
Encounter_ReasonCode                   object
Encounter_ReasonDescription            object
BIRTHDATE                              object
DEATHDATE                              object
FIRST                                  object
LAST                                   object
MARITAL                                object
RACE                                   object
ETHNICITY                              object
GENDER                                 object
CITY                              

In [35]:
final_dataset.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
Encounter_ID,60922.0,27891.0,66b2ab44-a2cc-8053-8f4e-c5be57e50cc4,186.0,NaT,NaT,,,,,,,
Encounter_Start,60922.0,27541.0,2017-07-14 14:22:57,186.0,2011-01-02 09:26:36,2022-02-05 20:27:36,,,,,,,
Encounter_Stop,60922.0,27765.0,2017-07-14 14:37:57,186.0,2011-01-02 12:58:36,2022-02-05 20:42:36,,,,,,,
Patient_ID,60922.0,974.0,1712d26d-822d-1e3a-2267-0a9dba31d7c8,1931.0,NaT,NaT,,,,,,,
Payer_ID,60922.0,10.0,7caa7254-5050-3b5e-9eae-bd5ea30e809c,25961.0,NaT,NaT,,,,,,,
Encounter_Class,60922.0,6.0,ambulatory,23508.0,NaT,NaT,,,,,,,
Encounter_Code,60922.0,45.0,185349003.0,15050.0,NaT,NaT,,,,,,,
Encounter_Description,60922.0,53.0,Encounter for check up (procedure),14010.0,NaT,NaT,,,,,,,
Base_Encounter_Cost,60922.0,,,,NaT,NaT,112.420501,27.848559,85.55,85.55,87.71,142.58,146.18
Total_Claim_Cost,60922.0,,,,NaT,NaT,6569.141692,16708.569918,0.0,232.47,704.2,5943.6,641882.7


In [36]:
final_dataset.head()

Unnamed: 0,Encounter_ID,Encounter_Start,Encounter_Stop,Patient_ID,Payer_ID,Encounter_Class,Encounter_Code,Encounter_Description,Base_Encounter_Cost,Total_Claim_Cost,...,Procedure_Stop,ENCOUNTER,Procedure_Code,Procedure_Description,Procedure_Base_Cost,Procedure_ReasonCode,Procedure_ReasonDescription,Age,Encounter_Duration,Procedure_Duration
0,32c84703-2481-49cd-d571-3899d5820253,2011-01-02 09:26:36,2011-01-02 12:58:36,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,1018.02,...,2011-01-02 12:58:36,32c84703-2481-49cd-d571-3899d5820253,265764009,Renal dialysis (procedure),903.0,Other,Other,88.0,3.53,3.53
1,9de5f0b0-4ba4-ce6f-45fb-b55c202f31a5,2011-01-05 12:58:36,2011-01-05 16:42:36,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,1370.79,...,2011-01-05 16:42:36,9de5f0b0-4ba4-ce6f-45fb-b55c202f31a5,265764009,Renal dialysis (procedure),1255.0,Other,Other,88.0,3.73,3.73
2,03f54837-bfc8-81aa-4905-f74ceb35162f,2011-01-08 16:42:36,2011-01-08 20:15:36,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,1671.32,...,2011-01-08 20:15:36,03f54837-bfc8-81aa-4905-f74ceb35162f,265764009,Renal dialysis (procedure),1556.0,Other,Other,88.0,3.55,3.55
3,60fd4512-b01b-8b81-a17c-355da9e2f8f6,2011-01-11 20:15:36,2011-01-11 22:49:36,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,943.24,...,2011-01-11 22:49:36,60fd4512-b01b-8b81-a17c-355da9e2f8f6,265764009,Renal dialysis (procedure),828.0,Other,Other,88.0,2.57,2.57
4,11be4683-6519-e8ca-94fa-a75c5f7a0474,2011-01-14 22:49:36,2011-01-15 01:58:36,3de74169-7f67-9304-91d4-757e0f3a14d2,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,ambulatory,185347001,Encounter for problem (procedure),85.55,1744.71,...,2011-01-15 01:58:36,11be4683-6519-e8ca-94fa-a75c5f7a0474,265764009,Renal dialysis (procedure),1630.0,Other,Other,88.0,3.15,3.15


#### Univariate Analysis

In [40]:
# No. of Unique Encounters
print("Unique Encounters:", final_dataset['Encounter_ID'].nunique())
# No. of Unique Patient's
print("Unique Patient's:", final_dataset['Patient_ID'].nunique())
# No. of Unique Insurance Payers
print("Unique Insurance Payers:", final_dataset[final_dataset['Payer_Name']!='NO_INSURANCE'].nunique()[4])
# No. of Unique Encounter Class
print("Unique Encounter Class:", final_dataset["Encounter_Class"].nunique())
# Total Procedures
print("Total Procedures:", final_dataset[final_dataset['Procedure_Code']!='Other'].shape[0])
# No. of Unique Procedures
print("No. of Unique Procedures:", final_dataset[final_dataset['Procedure_Code']!='Other'].nunique()[29])


Unique Encounters: 27891
Unique Patient's: 974
Unique Insurance Payers: 9
Unique Encounter Class: 6
Total Procedures: 47701
No. of Unique Procedures: 157
