# Read and store the data set

The data information can be accessed by the URL:

[The data set from UCI Irvine](https://archive.ics.uci.edu/dataset/880/support2)

| Variable Name | Role      | Type        | Description |
|--------------|----------|------------|-------------|
| **id**       | ID       | Integer    | Patient ID |
| **age**      | Feature  | Continuous | Age of the patient in years |
| **death**    | Target   | Continuous | Death at any time up to National Death Index (NDI) data on Dec 31, 1994. Some patients are discharged before the end of the study and are not followed up. |
| **sex**      | Feature  | Categorical | Gender of the patient (male, female) |
| **hospdead** | Target   | Binary     | Death in hospital |
| **slos**     | Other    | Continuous | Days from Study Entry to Discharge |
| **d.time**   | Other    | Continuous | Days of follow-up |
| **dzgroup**  | Feature  | Categorical | Patient’s disease subcategory among ARF/MOSF w/Sepsis, CHF, COPD, Cirrhosis, Colon Cancer, Coma, Lung Cancer, MOSF w/Malig. |
| **dzclass**  | Feature  | Categorical | Disease category: ARF/MOSF, COPD/CHF/Cirrhosis, Cancer, Coma |
| **num.co**   | Feature  | Continuous | Number of comorbidities, ordinal (higher values = worse condition) |
| **edu**      | Feature  | Categorical | Years of education (missing values) |
| **income**   | Feature  | Categorical | Patient income: {"$11-$25k", "$25-$50k", ">$50k", "under $11k"} (missing values) |
| **scoma**    | Feature  | Continuous | SUPPORT day 3 Coma Score (Glasgow scale, predicted) |
| **charges**  | Feature  | Continuous | Hospital charges |
| **totcst**   | Feature  | Continuous | Total ratio of costs to charges (RCC) cost |
| **totmcst**  | Feature  | Continuous | Total micro cost |
| **avtisst**  | Feature  | Continuous | Average TISS score (days 3-25, ICU cost calculation) |
| **race**     | Feature  | Categorical | Race: {asian, black, hispanic, missing, other, white} (missing values) |
| **sps**      | Feature  | Continuous | SUPPORT physiology score on day 3 (predicted) |
| **aps**      | Feature  | Continuous | APACHE III day 3 physiology score |
| **surv2m**   | Feature  | Continuous | SUPPORT model 2-month survival estimate at day 3 (predicted) |
| **surv6m**   | Feature  | Continuous | SUPPORT model 6-month survival estimate at day 3 (predicted) |
| **hday**     | Feature  | Integer    | Day in hospital at which patient entered study |
| **diabetes** | Feature  | Continuous | Patient has diabetes (Yes/No) |
| **dementia** | Feature  | Continuous | Patient has dementia (Yes/No) |
| **ca**       | Feature  | Categorical | Cancer status: yes, metastatic, no |
| **prg2m**    | Feature  | Continuous | Physician’s 2-month survival estimate (missing values) |
| **prg6m**    | Feature  | Categorical | Physician’s 6-month survival estimate (missing values) |
| **dnr**      | Feature  | Categorical | Do Not Resuscitate (DNR) order status (missing values) |
| **dnrday**   | Feature  | Continuous | Day of DNR order (<0 if before study) (missing values) |
| **meanbp**   | Feature  | Continuous | Mean arterial blood pressure at day 3 |
| **wblc**     | Feature  | Continuous | White blood cell count (day 3) |
| **hrt**      | Feature  | Continuous | Heart rate (day 3) |
| **resp**     | Feature  | Continuous | Respiration rate (day 3) |
| **temp**     | Feature  | Continuous | Temperature in Celsius (day 3) |
| **pafi**     | Feature  | Continuous | $PaO_2/FiO_2$ ratio (hypoxaemia indicator, day 3) |
| **alb**      | Feature  | Continuous | Serum albumin levels (day 3) |
| **bili**     | Feature  | Continuous | Bilirubin levels (day 3) |
| **crea**     | Feature  | Continuous | Serum creatinine levels (day 3) |
| **sod**      | Feature  | Continuous | Serum sodium concentration (day 3) |
| **ph**       | Feature  | Continuous | Arterial blood pH (day 3) |
| **glucose**  | Feature  | Integer    | Glucose levels (day 3) |
| **bun**      | Feature  | Integer    | Blood urea nitrogen levels (day 3) |
| **urine**    | Feature  | Integer    | Urine output (day 3) |
| **adlp**     | Feature  | Categorical | Index of Activities of Daily Living (ADL) filled out by patient (day 3) |
| **adls**     | Feature  | Continuous | Index of Activities of Daily Living (ADL) filled out by a surrogate (day 3) |
| **sfdm2**    | Target   | Categorical | Level of functional disability (1-5 scale, SIP questionnaire) (missing values) |
| **adlsc**    | Feature  | Continuous | Imputed ADL Calibrated to Surrogate |

In [1]:
from ucimlrepo import fetch_ucirepo 
# scientific computing libraries
import pandas as pd
import numpy as np

In [2]:
# import customized functions
from DataStats import distribution_stats, numerical_correlation

In [None]:
# fetch dataset from UCI repository
support2 = fetch_ucirepo(id=880) 
  
# data (as pandas dataframes) 
data = support2.data['original']

data.info()

In [9]:
# fetch dataset from local directory
data = pd.read_csv('../data/original_data.csv').drop('Unnamed: 0', axis=1)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9105 entries, 0 to 9104
Data columns (total 48 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        9105 non-null   int64  
 1   age       9105 non-null   float64
 2   death     9105 non-null   int64  
 3   sex       9105 non-null   object 
 4   hospdead  9105 non-null   int64  
 5   slos      9105 non-null   int64  
 6   d.time    9105 non-null   int64  
 7   dzgroup   9105 non-null   object 
 8   dzclass   9105 non-null   object 
 9   num.co    9105 non-null   int64  
 10  edu       7471 non-null   float64
 11  income    6123 non-null   object 
 12  scoma     9104 non-null   float64
 13  charges   8933 non-null   float64
 14  totcst    8217 non-null   float64
 15  totmcst   5630 non-null   float64
 16  avtisst   9023 non-null   float64
 17  race      9063 non-null   object 
 18  sps       9104 non-null   float64
 19  aps       9104 non-null   float64
 20  surv2m    9104 non-null   floa

In [10]:
numerical_correlation(data)

  return raw_corr_matrix.style.applymap(


Unnamed: 0,id,age,death,hospdead,slos,d.time,num.co,edu,scoma,charges,totcst,totmcst,avtisst,sps,aps,surv2m,surv6m,hday,diabetes,dementia,prg2m,prg6m,dnrday,meanbp,wblc,hrt,resp,temp,pafi,alb,bili,crea,sod,ph,glucose,bun,urine,adlp,adls,adlsc
id,1.0,0.037366,-0.045849,-0.071871,-0.022347,-0.053008,0.088782,-0.250794,-0.005346,-0.242265,-0.15267,-0.167094,-0.030589,-0.079378,-0.060142,0.051146,0.045769,-0.039508,0.04895,-0.016523,0.011055,0.024512,-0.036303,0.123582,0.032289,-0.016056,0.023704,-0.092149,-0.013898,0.039315,-0.16364,-0.051702,-0.045403,-0.020055,-0.053758,-0.115053,-0.003499,-0.00209,-0.001007,-0.050149
age,0.037366,1.0,0.17923,0.039354,-0.085518,-0.134531,0.127986,-0.135537,0.006557,-0.167864,-0.155399,-0.134466,-0.133693,-0.115444,-0.032576,-0.158056,-0.197112,-0.071923,0.095958,0.195303,-0.084631,-0.131794,-0.156809,-0.040134,0.004311,-0.126307,-0.021644,-0.085966,0.017296,0.082836,-0.132711,0.00191,0.01398,-0.020398,-0.009136,0.023265,-0.045094,0.043799,0.096926,0.115379
death,-0.045849,0.17923,1.0,0.404793,-0.083098,-0.710489,0.068634,-0.002943,0.135986,-0.014402,-0.022615,-0.020465,0.117293,0.156346,0.156866,-0.325049,-0.378247,0.063845,0.01194,0.044391,-0.306923,-0.383796,-0.146655,-0.031753,-0.004825,-0.003841,0.006331,-0.029459,0.002258,0.014215,0.029658,0.026705,-0.018409,-0.008592,0.002928,0.019307,-0.004537,0.107614,0.146124,0.15476
hospdead,-0.071871,0.039354,0.404793,1.0,-0.010579,-0.486616,-0.077746,0.015941,0.379582,0.187214,0.191738,0.176864,0.552508,0.462578,0.475855,-0.558111,-0.498079,0.213804,-0.013186,0.025831,-0.503983,-0.430151,-0.055778,-0.087594,0.069621,0.095483,0.020777,0.069972,-0.103693,-0.143165,0.167734,0.098724,0.029903,-0.038199,0.012128,0.036711,-0.016647,0.099518,0.090047,0.12635
slos,-0.022347,-0.085518,-0.083098,-0.010579,1.0,0.096903,-0.110978,0.031547,0.037513,0.641403,0.772046,0.768692,0.29364,0.113168,0.150851,-0.047855,-0.016808,0.204155,0.006075,-0.01082,-0.037096,0.021267,0.882923,0.006807,0.066488,0.09168,-0.00018,0.102539,-0.057788,-0.101903,0.027496,0.044876,0.033885,0.03454,0.008136,0.012597,0.023694,0.105861,0.0297,0.019143
d.time,-0.053008,-0.134531,-0.710489,-0.486616,0.096903,1.0,-0.045936,-0.005906,-0.20366,-0.035696,-0.010532,-0.052847,-0.225478,-0.231486,-0.219636,0.392719,0.429069,-0.089718,-0.00812,-0.054488,0.387729,0.428419,0.158174,0.057194,-0.039042,-0.039514,-0.002243,0.008446,0.05791,0.064473,-0.072876,-0.046206,0.000313,0.022283,-0.002414,-0.019703,0.018035,-0.087903,-0.123983,-0.136521
num.co,0.088782,0.127986,0.068634,-0.077746,-0.110978,-0.045936,1.0,-0.109775,-0.126215,-0.108838,-0.147501,-0.144888,-0.164272,-0.048492,0.01647,0.09958,0.086865,-0.080627,0.387569,0.139526,0.063745,0.030098,-0.116699,-0.016713,-0.010753,-0.067751,0.011633,-0.113781,0.074331,0.056395,0.000581,0.033319,-0.025437,-0.020673,0.007348,0.03055,-0.033966,0.077699,0.142791,0.142836
edu,-0.250794,-0.135537,-0.002943,0.015941,0.031547,-0.005906,-0.109775,1.0,-0.000482,0.112803,0.090229,0.099033,0.024131,0.025665,0.001937,-0.02378,-0.03092,0.040076,-0.075112,-0.013413,0.013826,-0.000537,0.036512,-0.033463,-0.007368,-0.003417,-0.025647,0.028365,-0.02856,-0.024726,0.057199,0.008913,-0.005265,0.022757,0.011323,0.026304,0.015514,-0.088278,-0.0812,-0.067644
scoma,-0.005346,0.006557,0.135986,0.379582,0.037513,-0.20366,-0.126215,-0.000482,1.0,0.137029,0.116443,0.096084,0.313402,0.278437,0.279056,-0.590313,-0.499572,0.116485,-0.001114,0.072192,-0.404672,-0.31801,-0.006522,-0.03711,0.079583,0.026471,-0.000831,0.09815,-0.028703,-0.06546,0.107912,0.068842,0.081112,-0.002444,0.027356,0.021288,0.005098,0.042039,0.074462,0.126339
charges,-0.242265,-0.167864,-0.014402,0.187214,0.641403,-0.035696,-0.108838,0.112803,0.137029,1.0,0.871896,0.814307,0.449707,0.264415,0.311941,-0.210029,-0.162642,0.476965,-0.023669,-0.045827,-0.131908,-0.050267,0.620366,-0.046848,0.042278,0.129735,0.016292,0.127782,-0.08139,-0.099543,0.216148,0.090203,0.051826,0.042832,0.06565,0.089898,0.030887,0.04898,-0.013853,0.008519


# EDA

Check the distribution of these numerical variables.

In [11]:
descriptive_stats = distribution_stats(data)

There is at least one variable that is highly skewed
There is at least one variable that has more outliers
There is no vairable that is normally distributed




    1.Skewness:
    
    •	If skewness is close to 0, the distribution is approximately symmetric (a property of a normal distribution).
	•	Large positive skewness (>1) means the distribution has a long right tail.
	•	Large negative skewness (<-1) means the distribution has a long left tail.

    2.	Kurtosis:

	•	Measures whether the distribution has heavier or lighter tails compared to a normal distribution.
	•	Kurtosis > 3 → Heavy tails (leptokurtic, more outliers).
	•	Kurtosis < 3 → Light tails (platykurtic, fewer outliers).
	•	Kurtosis = 3 → Normal distribution.

	3.	Shapiro-Wilk Test:
	•	Tests whether the data is normally distributed.
	•	Returns a p-value:
	•	p > 0.05 → Data is likely normal.
	•	p ≤ 0.05 → Data is not normal.

In [7]:
descriptive_stats

Unnamed: 0.1,Unnamed: 0,id,age,death,hospdead,slos,d.time,num.co,edu,scoma,...,bili,crea,sod,ph,glucose,bun,urine,adlp,adls,adlsc
count,9105.0,9105.0,9105.0,9105.0,9105.0,9105.0,9105.0,9105.0,7471.0,9104.0,...,6504.0,9038.0,9104.0,6821.0,4605.0,4753.0,4243.0,3464.0,6238.0,9105.0
mean,4552.0,4553.0,62.65082,0.681054,0.259198,17.863042,478.449863,1.868644,11.74769,12.058546,...,2.554463,1.770961,137.5685,7.415364,159.873398,32.349463,2191.546,1.15791,1.637384,1.888272
std,2628.531434,2628.531434,15.59371,0.466094,0.438219,22.00644,560.383272,1.344409,3.447743,24.636694,...,5.318448,1.686041,6.029326,0.0805635,88.391541,26.792288,1455.246,1.739672,2.231358,2.003763
min,0.0,1.0,18.04199,0.0,0.0,3.0,3.0,0.0,0.0,0.0,...,0.099991,0.099991,110.0,6.829102,0.0,1.0,0.0,0.0,0.0,0.0
25%,2276.0,2277.0,52.797,0.0,0.0,6.0,26.0,1.0,10.0,0.0,...,0.5,0.899902,134.0,7.379883,103.0,14.0,1165.5,0.0,0.0,0.0
50%,4552.0,4553.0,64.85699,1.0,0.0,11.0,233.0,2.0,12.0,0.0,...,0.899902,1.199951,137.0,7.419922,135.0,23.0,1968.0,0.0,1.0,1.0
75%,6828.0,6829.0,73.99896,1.0,1.0,20.0,761.0,3.0,14.0,9.0,...,1.899902,1.899902,141.0,7.469727,188.0,42.0,3000.0,2.0,3.0,3.0
max,9104.0,9105.0,101.848,1.0,1.0,343.0,2029.0,9.0,31.0,100.0,...,63.0,21.5,181.0,7.769531,1092.0,300.0,9000.0,7.0,7.0,7.073242
skewness,0.0,0.0,-0.5021164,-0.777072,1.099244,4.624615,1.199262,0.823294,-0.05826728,2.333585,...,4.817305,3.225092,0.3573931,-1.028938,2.577963,1.880382,0.9720198,1.693219,1.201637,0.937715
kurtosis,-1.2,-1.2,-0.1659246,-1.396358,-0.79206,35.079345,0.358316,0.643607,1.495601,4.845473,...,28.647885,14.297583,1.602005,3.33512,12.722037,5.240971,1.316135,2.015955,0.042767,-0.066667


visualize their distributions.