<h1>Mid-Term Project</h1>
<h3>Instructions</h3>
<ul>
<li>Complete the data ingestion, exploration, and wrangling process by:
    <ul>
        <li>Identifying 5 features in the data that is problematic and the solutions to resolve it</li>
        <li>Provide 5 steps for standardizing the dataset</li>
        <li>Create a short PowerPoint slide to showcase your work</li>
        <li>The PowerPoint should be between 5-15 slides. A PowerPoint template will be provided:
            <ul>
                <li>Provide a description and context of the dataset</li>
                <li>Highlight the work done on the dataset</li>
                <li>Provide an assessment of the data quality before and after</li>
                <li>Provide your recommendation of the data. Whether others should or should use the data</li>
            </ul>
          </li>
    </ul>
        </li>
    <li>Submission:
            <ul>
                <li>A working Jupyter notebook that showcase all the steps done to the data and a PowerPoint slide deck</li>
                <li>The project will be graded by your team member. Each team member will grade 2 projects that are not their own. The midterm project grade will be an average of the 2 scores. A rubric will be provided to help with the scoring.</li>
            </ul>
    </li>

In [None]:
%%time 
import os  
import pandas as pd  
import numpy as np 
import pyodbc
import matplotlib.pyplot as plt
import pickle
from unicodedata import normalize
from IPython.display import HTML
from datetime import datetime, date


class loc:
    d0 = os.getcwd() + '\\'  
    d1 = d0 + 'Exercises\\Mid_Term_Project\\'
    pdr = '\\\\clinisilonhh\\ifs\\PHI_Access\\PHI-CO - DSci Nhan Tran\\ENT9521 Python Curriculum\\data\\'
    
print(loc.d1)

In [None]:
%%time 

#read the mid-term dataset
with open(loc.pdr + 'midterm_dataset.pkl', 'rb') as f:
    MyMidData = pickle.load(f)
    
    display(MyMidData)

In [None]:
%%time 

MyMidData.info()

<h1>Problems in this Dataset</h1>
<ol>
<li>Dataset datatypes should align with the values</li>
<li>It will be better if Zip code has only 5 digits - so it needs cleaning</li>
<li>Ethinicity column has a room for improvement - to make it precise and clean</li>
<li>It will be helpful if We have Patient Age in the Dataset</li>
<li>Vizient_Sub_Service_Line can be splitted into Vizient_Service_Line and Sub Service_Line</li>
<li>It will be more helpful if Vizient MSDRG is splitted into code and description</li>
<li>The categorical variables such as Ethnicity, Sex, Race, and SECTION need to be converted into Dummy Variables to make this Dataset ready for Advanced Statistical Analysis</li>   
</ol>

<h1>Solutions to the Identified Problems: Dataset Standardization</h1>

In [None]:
%%time 

#Convert DATABASEID to int
#First, let me create a copy of the original dataset
#This will keep the original dataset unaffected by the changes I make to standardize the data
MyStandardData = MyMidData.copy()

MyStandardData.loc[:, ['DATABASEID']] = (MyStandardData.loc[:, ['DATABASEID']]).astype('int64')

#for example, see the comparison below
print('Before')
MyMidData.loc[:, ['DATABASEID']].info()
print('\n')

print('After')
MyStandardData.loc[:, ['DATABASEID']].info()

In [None]:
%%time 

#Convert AdmissionDsate and DischargeDate to date
#Since the simple pd.to_datetime() is showing errors for the date format we have in the dataset, 
#I have used another workaround as follows:
MyStandardData.loc[:, ['ADMISSIONDATE', 'DISCHARGEDATE']] = pd.to_datetime((MyStandardData.loc[:, ['ADMISSIONDATE', 'DISCHARGEDATE']]).stack()).unstack()

MyStandardData.loc[:, ['ADMISSIONDATE', 'DISCHARGEDATE']].info()

In [None]:
%%time 

#Clean Zip Code to have only the first five digits but keep the data type as object
MyStandardData.loc[:, ('ZIPCODE')] = (MyStandardData.loc[:, ('ZIPCODE')]).str[:5]

print("Zip code before cleaning")
display(MyMidData.loc[:, ('ZIPCODE')].head())
print('\n')

print("Zip code after cleaning")
display(MyStandardData.loc[:, ('ZIPCODE')].head())

In [None]:
# Cheking the progress
print('Before Standardization:')
display(MyMidData.iloc[:,0:6].head())
print('\n')

print('After Standardization')
display(MyStandardData.iloc[:,0:6].head())
print('\n')

print('Datatypes Before Standardization:')
MyMidData.iloc[:,0:6].info()
print('\n')

print('Datatypes After Standardization:')
MyStandardData.iloc[:,0:6].info()

In [None]:
%%time 

#Let's see the unique values in ethnicity
print('Ethnicity unique values before cleaning')
display(MyStandardData.loc[:, ('ETHNICITY')].unique())
print('\n')

#I want to replace the ' Origin' part with nothing because I believe that is just an extra information
print('Ethnicity unique values after cleaning')
MyStandardData.loc[:, ('ETHNICITY')] = MyStandardData.loc[:, ('ETHNICITY')].str.replace(' Origin', '')

display(MyStandardData.loc[:, ('ETHNICITY')].unique())

In [None]:
%%time 

#Let's see the unique values in the Sex column
print('Sex column unique values')
display(MyStandardData.loc[:, ('SEX')].unique())
print('\n')

#Let's see the unique values in the Race column
print('Race column unique values')
display(MyStandardData.loc[:, ('RACE')].unique())
print('\n')

#Let's see the unique values in the Race column
print('DEATHFLAG column unique values')
display(MyStandardData.loc[:, ('DEATHFLAG')].unique())
print('\n')

In [None]:
%%time 

#converting the birthdate column into datetime datatype, converting the DEATHFLAG column into int
MyStandardData.loc[:, ['BIRTHDATE']] = pd.to_datetime((MyStandardData.loc[:, ['BIRTHDATE']]).stack()).unstack()

MyStandardData.loc[:, ['DEATHFLAG']] = (MyStandardData.loc[:, ['DEATHFLAG']]).astype('int64')

MyStandardData.loc[:, ['BIRTHDATE', 'DEATHFLAG']].info()

In [None]:
%%time 

#To calculate the patients age when they first admit to the hospital
#I need to have the year, month and day extracted from both columns first
#And, I am using the standardized columns

MyStandardData['ADMISSIONYEAR'] = MyStandardData['ADMISSIONDATE'].dt.year
MyStandardData['ADMISSIONMONTH'] = MyStandardData['ADMISSIONDATE'].dt.month
MyStandardData['ADMISSIONDAY'] = MyStandardData['ADMISSIONDATE'].dt.day

print('ADMISSION DATE, YEAR, MONTH, AND DAY:')
display(MyStandardData[['ADMISSIONDATE', 'ADMISSIONYEAR', 'ADMISSIONMONTH', 'ADMISSIONDAY']].head())
print('\n')

MyStandardData['BIRTHYEAR'] = MyStandardData['BIRTHDATE'].dt.year
MyStandardData['BIRTHMONTH'] = MyStandardData['BIRTHDATE'].dt.month
MyStandardData['BIRTHDAY'] = MyStandardData['BIRTHDATE'].dt.day

print('BIRTHDATE, YEAR, MONTH, AND DAY:')
display(MyStandardData[['BIRTHDATE', 'BIRTHYEAR', 'BIRTHMONTH', 'BIRTHDAY']].head())

In [None]:
#Calculating Age
#I subtract the birth year from the admission year and then if the admission month have not passed birth month 
#I subtract one as their birth date is yet to come

#MyStandardData.loc[(MyStandardData['ADMISSIONDAY'] < MyStandardData['BIRTHDAY']), 'DayNotPass'] = 1
#MyStandardData['MonthAndDayNotPass'] = np.where((MyStandardData['MonthNotPass'] == 1 ) 
#& (MyStandardData['DayNotPass'] == 1), 1, 0)

MyStandardData.loc[(MyStandardData['ADMISSIONMONTH'] < MyStandardData['BIRTHMONTH']), 'MonthNotPass'] = 1

MyStandardData['MonthAndDayNotPass'] = np.where(MyStandardData['MonthNotPass'] == 1, 1, 0)

MyStandardData['AGE'] = MyStandardData['ADMISSIONYEAR'] - MyStandardData['BIRTHYEAR'] - MyStandardData['MonthAndDayNotPass']

display(MyStandardData[['ADMISSIONDATE', 'BIRTHDATE', 'AGE']].head())

In [None]:
#Splitting the VIZIENT_SUB_SERVICELINE column into VIZIENT_SERVICELINE and SUB_SERVICELINE 
VizServiceLine = MyStandardData.loc[:, ('VIZIENT_SUB_SERVICELINE')].str.split('-', expand=True);
MyStandardData['VIZ_SERVICELINE'] = VizServiceLine[0];
MyStandardData['VIZ_SUB_SERVICELINE'] = VizServiceLine[1];

display(MyStandardData.head());

In [None]:
#Let me concatenate what I standardized so far
#MySubMidData2 = pd.concat([MySubMidData1, MyDateData12, MyFloatData1, MyDataZip], axis = 1, join='outer', ignore_index=False)

In [None]:
#Splitting the VIZIENT_MSDRG column into VIZIENT_MSDRG_CODE and VIZIENT_MSDRG_CODE_DESCRIPTION
VIZIENT_MSDRG_C_AND_DESC = MyStandardData.loc[:, ('VIZIENT_MSDRG')].str.split(':', expand=True);
MyStandardData['VIZIENT_MSDRG_CODE'] = VIZIENT_MSDRG_C_AND_DESC[0];
MyStandardData['VIZIENT_MSDRG_CODE_DESCRIPTION'] = VIZIENT_MSDRG_C_AND_DESC[1];

display(MyStandardData.head());

In [None]:
MyStandardData.info()

In [None]:
#Dropping the columns thast were created for calculation purposes, and storing the data into a new dataframe
#I do not want to replace the standardized dataset to avoid any potential errors when I run it again
MyStandardData21 = MyStandardData.drop(columns=['VIZIENT_SUB_SERVICELINE', 'VIZIENT_MSDRG', 'ADMISSIONYEAR', 'ADMISSIONMONTH', 'ADMISSIONDAY', 'BIRTHYEAR', 'BIRTHMONTH', 'BIRTHDAY', 'MonthNotPass', 'MonthAndDayNotPass'])
MyStandardData21.info();

MyStandardData21.loc[0:21, ['VIZ_SERVICELINE', 'VIZ_SUB_SERVICELINE']]

In [None]:
#creating dummy variables
#Before i do that, let me check the categorial variables I have, and their unique values

#Let's see the unique values in the Sex column
print('ETHNICITY column unique values')
display(MyStandardData21.loc[:, ('ETHNICITY')].unique())
print('\n')

#Let's see the unique values in the Race column
print(' SEX column unique values')
display(MyStandardData21.loc[:, ('SEX')].unique())
print('\n')

#Let's see the unique values in the Race column
print('RACE column unique values')
display(MyStandardData21.loc[:, ('RACE')].unique())
print('\n')

print('SECTION column unique values')
display(MyStandardData21.loc[:, ('SECTION')].unique())
print('\n')

print('VIZ_SERVICELINE column unique values')
display(MyStandardData21.loc[:, ('VIZ_SERVICELINE')].unique())
print('\n')

In [None]:
#As I can see from the above unique values: 
#I am not going to create dummy variables for VIZ_SERVICELINE column as it is going to be too much information
#I will replace 'Hispanic Unknown' by 'Unknown' for convenience
#I will replace Non Hispanic' by 'Non_Hispanic' for convenience
#I will replace 'Unavailable' in Race column by 'Unknown' for convenience 

MyStandardData21.loc[:, ('ETHNICITY')] = MyStandardData21.loc[:, ('ETHNICITY')].str.replace('Hispanic Unknown', 'Unknown');
MyStandardData21.loc[:, ('ETHNICITY')] = MyStandardData21.loc[:, ('ETHNICITY')].str.replace('Non Hispanic', 'Non_Hispanic');
MyStandardData21.loc[:, ('RACE')] = MyStandardData21.loc[:, ('RACE')].str.replace('Unavailable', 'Unknown');


#lET'S SEE THE UNIQUE VALUES IN THE CATEGORICAL  VARIABLES NOW

#Let's see the unique values in the Sex column
print('ETHNICITY column unique values')
display(MyStandardData21.loc[:, ('ETHNICITY')].unique())
print('\n')

#Let's see the unique values in the Race column
print(' SEX column unique values')
display(MyStandardData21.loc[:, ('SEX')].unique())
print('\n')

#Let's see the unique values in the Race column
print('RACE column unique values')
display(MyStandardData21.loc[:, ('RACE')].unique())
print('\n')

print('SECTION column unique values')
display(MyStandardData21.loc[:, ('SECTION')].unique())
print('\n')

In [None]:
#Let's drop missing values as imputing data is not recomended in clinical data 
MyStandardData3 = MyStandardData21.dropna()

#Let's see how many records we dropped due to missing values 
print('Before dropping missing values:')
MyStandardData21.info()
print('\n')

print('After dropping missing values:')
MyStandardData3.info()
print('\n')

print('Dropped records:', 780842-773332) #RangeIndex Before dropping minus RangeIndex after dropping

In [None]:
#First let's see what our categorical variables look like
display(MyStandardData3.loc[0:10, ['SEX', 'ETHNICITY', 'RACE', 'SECTION']])

In [None]:
#Let's get the Dummies now -- Yayy!
MyStandardData4 = MyStandardData3.copy();

MyStandardData4 = pd.get_dummies(MyStandardData4, columns = ['SEX', 'ETHNICITY', 'RACE', 'SECTION']);

display(MyStandardData4);

In [None]:
display(MyStandardData4.iloc[0:11, 14:])

In [None]:
#Alright, let's describe the Dataset
display(MyStandardData4.describe())