# Applied Healthcare Analytics

# Lecture 2 - Data Wrangling (CKD Data)

Workshop Instructions: 

1. Read through the text descriptions at the top of the cell/code blocks
2. Run the code below by selecting the code block and pressing ``Ctrl + Enter``. Note: The preceding code blocks must be run before proceeding to the next block.
2. Think through the guiding questions and points that are raised for the step. What do you observe from the output and what do they mean?

In [1]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import os as os

# Open Sourced Datasets

The datasets in this notebook can be obtained from the **UCI Machine Learning Repository**

1. Chronic Kidney Disease Dataset: https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease

**References**:

1. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

# 1. Read in datasets

In [7]:
data_dir = 'C:\\_Work_Folder\\SMU\\MITB_HealthcareAnalytics\\Lectures\\Datasets\\CKD\\'
os.chdir(data_dir)
ckd = pd.read_csv('chronic_kidney_disease_full_unclean.csv')

# 2. Data Wrangling and Exploration

The first step in any data analysis process is inspecting the dataset itself and familiarizing with the data format and presentation. Use the `.head()`, `.column()`, `.info()`, `.describe()` etc methods on the DataFrames to have a look at the values in the columns and how they are presented. 

Key Considerations:

- Are there any missing values? How do you want to treat cases with these values?
- Are there duplicated rows?
- Is the data in a tidy format?
- Which is your target variable, or variable that you are interesting in predicting?
- Are there any outliers? What are the descriptives for each column?
- Need to rescale, recode categorical, aggregate, etc?
- Is the dataset balanced i.e. equal proportions of examples from each class label?

In [3]:
ckd

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48,80,1.02,1,0,?,normal,notpresent,notpresent,121,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,7,50,1.02,4,0,?,normal,notpresent,notpresent,?,...,38,6000,?,no,no,no,good,no,no,ckd
2,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423,...,31,7500,?,no,yes,no,poor,no,yes,ckd
3,48,70,1.005,4,0,normal,abnormal,present,notpresent,117,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,51,80,1.01,2,0,normal,normal,notpresent,notpresent,106,...,35,7300,4.6,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403,52,80,1.025,0,0,normal,normal,notpresent,notpresent,99,...,52,6300,5.3,no,no,no,good,no,no,notckd
404,36,80,1.025,0,0,normal,normal,notpresent,notpresent,85,...,44,5800,6.3,no,no,no,good,no,no,notckd
405,57,80,1.02,0,0,normal,normal,notpresent,notpresent,133,...,46,6600,5.5,no,no,no,good,no,no,notckd
406,43,60,1.025,0,0,normal,normal,notpresent,notpresent,117,...,54,7400,5.4,no,no,no,good,no,no,notckd


In [7]:
ckd.describe()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
count,408,408,408.0,408,408,408,408,408,408,408,...,408,408,408,408,408,408,408,408,408,408
unique,77,11,6.0,7,7,3,3,3,3,147,...,43,90,46,3,3,3,3,3,3,2
top,60,80,1.02,0,0,normal,normal,notpresent,notpresent,?,...,?,?,?,no,no,no,good,no,no,ckd
freq,19,122,110.0,207,298,209,267,362,382,44,...,71,106,131,259,269,372,325,331,347,250


## 2.1 Check and Remove rows that are complete duplicates

In [8]:
ckd[ckd.duplicated(subset=None)].sort_values(by=['age'])

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
401,12,80,1.02,0,0,normal,normal,notpresent,notpresent,100,...,49,6600,5.4,no,no,no,good,no,no,notckd
402,17,60,1.025,0,0,normal,normal,notpresent,notpresent,114,...,51,7200,5.9,no,no,no,good,no,no,notckd
404,36,80,1.025,0,0,normal,normal,notpresent,notpresent,85,...,44,5800,6.3,no,no,no,good,no,no,notckd
406,43,60,1.025,0,0,normal,normal,notpresent,notpresent,117,...,54,7400,5.4,no,no,no,good,no,no,notckd
407,50,80,1.02,0,0,normal,normal,notpresent,notpresent,137,...,45,9500,4.6,no,no,no,good,no,no,notckd
403,52,80,1.025,0,0,normal,normal,notpresent,notpresent,99,...,52,6300,5.3,no,no,no,good,no,no,notckd
400,57,80,1.02,0,0,normal,normal,notpresent,notpresent,133,...,46,6600,5.5,no,no,no,good,no,no,notckd
405,57,80,1.02,0,0,normal,normal,notpresent,notpresent,133,...,46,6600,5.5,no,no,no,good,no,no,notckd


In [4]:
ckd.drop_duplicates(inplace=True)

In [8]:
# Data Profiler is useful to understand the data

file = ProfileReport(ckd)
file.to_file(output_file='output.html')

Summarize dataset: 100%|██████████| 38/38 [00:07<00:00,  5.30it/s, Completed]                     
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.39s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.18it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 251.53it/s]


## 2.2 Check Data Types and Handle Missing Values

Missing values are common in healthcare datasets, where certain entries are missing or are not recorded for various reasons e.g. test was not done. These values can present themselves in the raw dataset in various ways: NA, NIL, ?, None...

As in many areas of analytics, there are multiple ways to deal with missing data

### 2.2.1 Replace with 'NaN' value

You can choose to replace these values with an `NaN` value, which is used by both Pandas and Numpy packages or a string value such as 'NA'. However, note that replacing numeric columns with a string will cause errors and incompatibility fitting these values into machine learning models since it changes the data format and type.

In this example, we will change the missing values: '?' to a `NaN` value. 

In [5]:
ckd.replace('?',np.nan,inplace=True)

### 2.2.2 Further Data Understanding

You should seek to understand the data types of each column and there missingness. Decisions made now on how to deal with these will affect any analytical computations downstream

In [11]:
file = ProfileReport(ckd)
file.to_file(output_file='output.html')
# Note that numeric and categorical columns are detected differently now. We have now identified the data types of the columns using the Data Profiler (13 numeric, 7 categorical and 5 boolean)

Summarize dataset: 100%|██████████| 209/209 [03:19<00:00,  1.05it/s, Completed]                   
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.30s/it]
Render HTML: 100%|██████████| 1/1 [00:06<00:00,  6.31s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 29.44it/s]


In [6]:
pd.set_option('display.max_columns', None)
ckd.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48,80,1.02,1,0,,normal,notpresent,notpresent,121.0,36,1.2,,,15.4,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,7,50,1.02,4,0,,normal,notpresent,notpresent,,18,0.8,,,11.3,38,6000,,no,no,no,good,no,no,ckd
2,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423.0,53,1.8,,,9.6,31,7500,,no,yes,no,poor,no,yes,ckd
3,48,70,1.005,4,0,normal,abnormal,present,notpresent,117.0,56,3.8,111.0,2.5,11.2,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,51,80,1.01,2,0,normal,normal,notpresent,notpresent,106.0,26,1.4,,,11.6,35,7300,4.6,no,no,no,good,no,no,ckd


In [7]:
# based on insights from the Data Profiler, we hypothesize the data types for each columns are as follows
categorical_cols = ['rbc','pc','pcc','ba','appet']
boolean_cols=['htn','dm','cad','pe','ane']
numeric_cols = ['age','bp','sg','al','su','bgr','bu','sc','sod','pot','hemo','pcv','wbcc','rbcc']

In [91]:
# Confirm the categories in each of the categorical and boolean cols
print('rbc - categorical')
print(ckd['rbc'].value_counts())
print('pc - categorical')
print(ckd['pc'].value_counts())
print('pcc - categorical')
print(ckd['pcc'].value_counts())
print('ba - categorical')
print(ckd['ba'].value_counts())
print('appet - categorical')
print(ckd['appet'].value_counts())

rbc - categorical
normal      209
abnormal     47
Name: rbc, dtype: int64
pc - categorical
normal      267
abnormal     76
Name: pc, dtype: int64
pcc - categorical
notpresent    362
present        42
Name: pcc, dtype: int64
ba - categorical
notpresent    382
present        22
Name: ba, dtype: int64
appet - categorical
good    325
poor     82
Name: appet, dtype: int64


In [21]:
print('htn - boolean')
print(ckd['htn'].value_counts())
print('dm - boolean')
print(ckd['dm'].value_counts())
print('cad - boolean')
print(ckd['cad'].value_counts())
print('pe - boolean')
print(ckd['pe'].value_counts())
print('ane - boolean')
print(ckd['ane'].value_counts())

htn - boolean
no     259
yes    147
Name: htn, dtype: int64
dm - boolean
no     269
yes    137
Name: dm, dtype: int64
cad - boolean
no     372
yes     34
Name: cad, dtype: int64
pe - boolean
no     331
yes     76
Name: pe, dtype: int64
ane - boolean
no     347
yes     60
Name: ane, dtype: int64


In [23]:
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     399 non-null    object
 1   bp      396 non-null    object
 2   sg      361 non-null    object
 3   al      362 non-null    object
 4   su      359 non-null    object
 5   rbc     256 non-null    object
 6   pc      343 non-null    object
 7   pcc     404 non-null    object
 8   ba      404 non-null    object
 9   bgr     364 non-null    object
 10  bu      389 non-null    object
 11  sc      391 non-null    object
 12  sod     321 non-null    object
 13  pot     320 non-null    object
 14  hemo    356 non-null    object
 15  pcv     337 non-null    object
 16  wbcc    302 non-null    object
 17  rbcc    277 non-null    object
 18  htn     406 non-null    object
 19  dm      406 non-null    object
 20  cad     406 non-null    object
 21  appet   407 non-null    object
 22  pe      407 non-null    ob

In [8]:
# As all the columns are of string types, we need to change them to the correct data types
ckd[numeric_cols]=ckd[numeric_cols].astype('float')
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     399 non-null    float64
 1   bp      396 non-null    float64
 2   sg      361 non-null    float64
 3   al      362 non-null    float64
 4   su      359 non-null    float64
 5   rbc     256 non-null    object 
 6   pc      343 non-null    object 
 7   pcc     404 non-null    object 
 8   ba      404 non-null    object 
 9   bgr     364 non-null    float64
 10  bu      389 non-null    float64
 11  sc      391 non-null    float64
 12  sod     321 non-null    float64
 13  pot     320 non-null    float64
 14  hemo    356 non-null    float64
 15  pcv     337 non-null    float64
 16  wbcc    302 non-null    float64
 17  rbcc    277 non-null    float64
 18  htn     406 non-null    object 
 19  dm      406 non-null    object 
 20  cad     406 non-null    object 
 21  appet   407 non-null    object 
 22  pe

In [93]:
ckd['bgr'].mean()

147.30494505494505

In [9]:
ckd['bgr_na']=ckd['bgr']

In [10]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(ckd[numeric_cols])
ckd[numeric_cols] = imp.transform(ckd[numeric_cols])
ckd[numeric_cols] = ckd[numeric_cols].astype(np.float32)

In [11]:
ckd[ckd['bgr_na'].isna()]

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class,bgr_na
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,147.304947,18.0,0.8,137.599686,4.62125,11.3,38.0,6000.0,4.729964,no,no,no,good,no,no,ckd,
21,60.0,90.0,1.017521,0.994475,0.440111,,,notpresent,notpresent,147.304947,180.0,76.0,4.5,4.62125,10.9,32.0,6200.0,3.6,yes,yes,yes,good,no,no,ckd,
23,21.0,70.0,1.01,0.0,0.0,,normal,notpresent,notpresent,147.304947,57.026222,3.028517,137.599686,4.62125,12.574438,39.109791,8368.874023,4.729964,no,no,no,poor,no,yes,ckd,
24,42.0,100.0,1.015,4.0,0.0,normal,abnormal,notpresent,present,147.304947,50.0,1.4,129.0,4.0,11.1,39.0,8300.0,4.6,yes,no,no,poor,no,no,ckd,
29,68.0,70.0,1.005,1.0,0.0,abnormal,abnormal,present,notpresent,147.304947,28.0,1.4,137.599686,4.62125,12.9,38.0,8368.874023,4.729964,no,no,yes,good,no,no,ckd,
38,69.0,80.0,1.02,3.0,0.0,abnormal,normal,notpresent,notpresent,147.304947,103.0,4.1,132.0,5.9,12.5,39.109791,8368.874023,4.729964,yes,no,no,good,no,no,ckd,
41,45.0,70.0,1.01,0.0,0.0,,normal,notpresent,notpresent,147.304947,20.0,0.7,137.599686,4.62125,12.574438,39.109791,8368.874023,4.729964,no,no,no,good,yes,no,ckd,
47,11.0,80.0,1.01,3.0,0.0,,normal,notpresent,notpresent,147.304947,17.0,0.8,137.599686,4.62125,15.0,45.0,8600.0,4.729964,no,no,no,good,no,no,ckd,
52,53.0,90.0,1.015,0.0,0.0,,normal,notpresent,notpresent,147.304947,38.0,2.2,137.599686,4.62125,10.9,34.0,4300.0,3.7,no,no,no,poor,no,yes,ckd,
54,63.0,80.0,1.01,2.0,2.0,normal,,notpresent,notpresent,147.304947,57.026222,3.4,136.0,4.2,13.0,40.0,9800.0,4.2,yes,no,yes,good,no,no,ckd,


In [12]:
ckd.drop(['bgr_na'],axis=1,inplace=True)

Notice how the missing values have been replaced by mean values of the entire column. There are many other methods available for imputation, such as Multi-Variate Imputation (MICE), K-Nearest Neighbours that can be used.

In [13]:
ckd[numeric_cols].describe()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0,408.0
mean,51.263153,76.439392,1.017517,0.994475,0.440111,147.304932,57.026215,3.028511,137.599701,4.621252,12.574438,39.109795,8368.871094,4.729967
std,17.032656,13.399882,0.005374,1.267613,1.021226,74.229401,48.904995,5.570064,9.143654,2.792665,2.708144,8.191044,2510.198975,0.841406
min,2.0,50.0,1.005,0.0,0.0,22.0,1.5,0.4,4.5,2.5,3.1,9.0,2200.0,2.1
25%,42.0,70.0,1.015,0.0,0.0,100.75,27.0,0.9,135.0,4.0,10.9,34.0,6900.0,4.5
50%,54.0,80.0,1.017521,0.0,0.0,125.0,44.5,1.3,137.599686,4.62125,12.574438,39.109791,8368.874023,4.729964
75%,64.0,80.0,1.02,2.0,0.440111,148.5,60.0,3.028517,141.0,4.8,14.8,44.0,9325.0,5.2
max,90.0,180.0,1.025,5.0,5.0,490.0,391.0,76.0,163.0,47.0,17.799999,54.0,26400.0,8.0


##  2.3 Recoding categorical and Boolean variables (e.g. One-Hot Encoding and Label Encoding)

Medical datasets often contain categorical variables, and may sometimes have a large number of categories. These variables can indicate the presence of certain conditions e.g. diabetes, hypertension, abnormal results. Before they can be used with certain machine learning models, they have to be transformed into suitable numeric representation, such as **one-hot encoding** or **label encoding**.

**Label Encoding**: Maps the $n$ categories of the variable into integer representations. This is a convenient way to convert the variable but may lead to misleading results, as the variable may be interpreted as a numeric variable instead.

**One-Hot Encoding**: Converts the $n$ categories of the variable in to $n$ new columns, each using a binary value to represent the $nth$ category. This is a common way of transforming categorical variables but may run into dimensionality issues when used with variables with a large number of categories. 

In [14]:
# Label encoding for Boolean variables ('Yes' -> True; 'No' -> False)
ckd[boolean_cols] = np.where(ckd[boolean_cols] == 'Yes',True,False)
ckd[boolean_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   htn     408 non-null    bool 
 1   dm      408 non-null    bool 
 2   cad     408 non-null    bool 
 3   pe      408 non-null    bool 
 4   ane     408 non-null    bool 
dtypes: bool(5)
memory usage: 2.1 KB


In [18]:
# We can also encode Categorical columns using label encoding as well, but the labels are ambiguous e.g., rbc: {NaN:-1, 'abnormal':0, 'normal': 1}
ckd_cat_le=ckd.copy()
ckd_cat_le[categorical_cols]=ckd_cat_le[categorical_cols].astype('category')
for col in categorical_cols:
    ckd_cat_le[col] = ckd_cat_le[col].cat.codes
ckd_cat_le

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.020,1.0,0.0,-1,1,0,0,121.000000,36.0,1.2,137.599686,4.62125,15.4,44.0,7800.0,5.200000,False,False,False,0,False,False,ckd
1,7.0,50.0,1.020,4.0,0.0,-1,1,0,0,147.304947,18.0,0.8,137.599686,4.62125,11.3,38.0,6000.0,4.729964,False,False,False,0,False,False,ckd
2,62.0,80.0,1.010,2.0,3.0,1,1,0,0,423.000000,53.0,1.8,137.599686,4.62125,9.6,31.0,7500.0,4.729964,False,False,False,1,False,False,ckd
3,48.0,70.0,1.005,4.0,0.0,1,0,1,0,117.000000,56.0,3.8,111.000000,2.50000,11.2,32.0,6700.0,3.900000,False,False,False,1,False,False,ckd
4,51.0,80.0,1.010,2.0,0.0,1,1,0,0,106.000000,26.0,1.4,137.599686,4.62125,11.6,35.0,7300.0,4.600000,False,False,False,0,False,False,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403,52.0,80.0,1.025,0.0,0.0,1,1,0,0,99.000000,25.0,0.8,135.000000,3.70000,15.0,52.0,6300.0,5.300000,False,False,False,0,False,False,notckd
404,36.0,80.0,1.025,0.0,0.0,1,1,0,0,85.000000,16.0,1.1,142.000000,4.10000,15.6,44.0,5800.0,6.300000,False,False,False,0,False,False,notckd
405,57.0,80.0,1.020,0.0,0.0,1,1,0,0,133.000000,48.0,1.2,147.000000,4.30000,14.8,46.0,6600.0,5.500000,False,False,False,0,False,False,notckd
406,43.0,60.0,1.025,0.0,0.0,1,1,0,0,117.000000,45.0,0.7,141.000000,4.40000,13.0,54.0,7400.0,5.400000,False,False,False,0,False,False,notckd


In [None]:
# One-hot encoding for Categorical variables. REmember to keep a dummy variable to indicate NaN
for col in categorical_cols:
    one_hot = pd.get_dummies(ckd[col], dummy_na=True)
    # retain a column to indicate NaN
    one_hot.columns=one_hot.columns.fillna('NA')
    one_hot.columns = [col + '_' + a for a in one_hot.columns]
    one_hot.columns
    ckd = ckd.join(one_hot)
ckd

###  2.4 Recoding/ Rescaling continuous variables 
1. Recoding age into categorical
2. Normalization of wbcc
3. Standardization of wbcc


In [20]:
# 1. Recoding age into categorical
# Sometimes needed to study risk factors across different age classes. Pre-elderly is unique class as many chronic diseases are detected in this category. Age classes can be an informative feature as well

def age_groups(series):
    if series < 19:
        return "Children"
    elif 20 <= series < 50:
        return "Adults"
    elif 49 <= series < 64:
        return "Pre-elderly"
    elif 65 <= series:
        return "Elderly"

ckd['Age Group'] = ckd['age'].apply(age_groups)
ckd['Age Group'].value_counts(sort=False)

Elderly         98
Children        21
Adults         139
Pre-elderly    140
Name: Age Group, dtype: int64

In [23]:
ckd['wbcc'].describe()

count      408.000000
mean      8368.871094
std       2510.198975
min       2200.000000
25%       6900.000000
50%       8368.874023
75%       9325.000000
max      26400.000000
Name: wbcc, dtype: float64

In [47]:
# 2. Normalization
# wbcc has a min of 2200, max of 26400. Maybe good to consider standardizing this. 
# Normalization can be used to transform the data within range of 0 to 1. 
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

scaler = MinMaxScaler()
ckd['wbcc_scaled_0_1'] = scaler.fit_transform(ckd[['wbcc']])
print('MinMaxScaler:\n{}'.format(ckd['wbcc_scaled_0_1'].describe()))


# There may be outliers in the original dataset and these may bias the rescaling
# Robust Scaler, which uses the IQR, can be used to normalize the dataset. 
robustscaler = RobustScaler()
ckd['wbcc_robust_scaler'] = robustscaler.fit_transform(ckd[['wbcc']])
print('RobustScaler:\n{}'.format(ckd['wbcc_robust_scaler'].describe()))

# However, sometimes domain knowledge may provide more useful norms 
# The normal number of WBCs in the blood is 4,500 to 11,000 WBCs per microliter (4.5 to 11.0 × 109/L). (https://www.ucsfhealth.org/medical-tests/wbc-count) We can consider standardizing wbcc based on domain knowledge 

ckd['wbcc_domain_scaled'] = (ckd['wbcc']-4500)/(11000-4500)
print('DomainScaler:\n{}'.format(ckd['wbcc_domain_scaled'].describe()))


MinMaxScaler:
count    408.000000
mean       0.254912
std        0.103727
min        0.000000
25%        0.194215
50%        0.254912
75%        0.294421
max        1.000000
Name: wbcc_scaled_0_1, dtype: float64
RobustScaler:
count    4.080000e+02
mean    -7.304491e-09
std      1.035134e+00
min     -2.543866e+00
25%     -6.057212e-01
50%      0.000000e+00
75%      3.942788e-01
max      7.435515e+00
Name: wbcc_robust_scaler, dtype: float64
DomainScaler:
count    408.000000
mean       0.595212
std        0.386185
min       -0.353846
25%        0.369231
50%        0.595211
75%        0.742308
max        3.369231
Name: wbcc_domain_scaled, dtype: float64


In [48]:
# 2. Standardization
# Transforms the features to a standard normal distribution with mean 0 and standard 1
std_scaler = StandardScaler()

ckd['wbcc_z_scaled'] = std_scaler.fit_transform(ckd[['wbcc']])
print('Standard Scaler:\n{}'.format(ckd['wbcc_z_scaled'].describe()))



Standard Scaler:
count    4.080000e+02
mean    -2.644226e-08
std      1.001228e+00
min     -2.460541e+00
25%     -5.858809e-01
50%     -4.391596e-08
75%      3.813641e-01
max      7.191965e+00
Name: wbcc_z_scaled, dtype: float64


## 3. Tidy Data - Melt

In [None]:
path='C:\\_Work_Folder\\SMU\\MITB_HealthcareAnalytics\\Lectures\\Datasets\\WHOChildrenMortality\\'
childmortality=pd.read_excel(path+'WHO Children Under 5 Death Causes - World.xlsx','Sheet1')
childmortality.head()

In [None]:
childmortality=childmortality.melt(id_vars = "year", var_name = "reason", value_name = "count")