<a href="https://colab.research.google.com/github/MarieTKD/support2/blob/main/regression_support2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
pip install ucimlrepo



In [13]:
import pandas as pd
import numpy as np

In [14]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
support2 = fetch_ucirepo(id=880)

# data (as pandas dataframes)
X = support2.data.features
y = support2.data.targets

# metadata
print(support2.metadata)

# variable information
print(support2.variables)


{'uci_id': 880, 'name': 'SUPPORT2', 'repository_url': 'https://archive.ics.uci.edu/dataset/880/support2', 'data_url': 'https://archive.ics.uci.edu/static/public/880/data.csv', 'abstract': "This dataset comprises 9105 individual critically ill patients across 5 United States medical centers, accessioned throughout 1989-1991 and 1992-1994.\nEach row concerns hospitalized patient records who met the inclusion and exclusion criteria for nine disease categories: acute respiratory failure, chronic obstructive pulmonary disease, congestive heart failure, liver disease, coma, colon cancer, lung cancer, multiple organ system failure with malignancy, and multiple organ system failure with sepsis. The goal is to determine these patients' 2- and 6-month survival rates based on several physiologic, demographics, and disease severity information. \nIt is an important problem because it addresses the growing national concern over patients' loss of control near the end of life. It enables earlier deci

In [15]:
y.head()

Unnamed: 0,death,hospdead,sfdm2
0,0,0,
1,1,1,<2 mo. follow-up
2,1,0,<2 mo. follow-up
3,1,0,no(M2 and SIP pres)
4,0,0,no(M2 and SIP pres)


In [16]:
X.head()

Unnamed: 0,age,sex,dzgroup,dzclass,num.co,edu,income,scoma,charges,totcst,...,bili,crea,sod,ph,glucose,bun,urine,adlp,adls,adlsc
0,62.84998,male,Lung Cancer,Cancer,0,11.0,$11-$25k,0.0,9715.0,,...,0.199982,1.199951,141.0,7.459961,,,,7.0,7.0,7.0
1,60.33899,female,Cirrhosis,COPD/CHF/Cirrhosis,2,12.0,$11-$25k,44.0,34496.0,,...,,5.5,132.0,7.25,,,,,1.0,1.0
2,52.74698,female,Cirrhosis,COPD/CHF/Cirrhosis,2,12.0,under $11k,0.0,41094.0,,...,2.199707,2.0,134.0,7.459961,,,,1.0,0.0,0.0
3,42.38498,female,Lung Cancer,Cancer,2,11.0,under $11k,0.0,3075.0,,...,,0.799927,139.0,,,,,0.0,0.0,0.0
4,79.88495,female,ARF/MOSF w/Sepsis,ARF/MOSF,1,,,26.0,50127.0,,...,,0.799927,143.0,7.509766,,,,,2.0,2.0


Among the targets, I will use death.
The doctors who created the dataset gave this recommendations to deal with missing values:
Baseline Variable	Normal Fill-in Value
- Serum albumin (alb)	3.5
- PaO2/FiO2 ratio (pafi) 	333.3
- Bilirubin (bili)	1.01
- Creatinine (crea)	1.01
- bun	6.51
- White blood count (wblc)	9 (thousands)
- Urine output (urine)	2502

I will use these numbers to fill-in the missing values then remove the rows with the leftover missing values.

In [None]:
X['alb'] = X['alb'].fillna(3.5)
X['pafi'] = X['pafi'].fillna(333.3)
X['bili'] = X['bili'].fillna(1.01)
X['crea'] = X['crea'].fillna(1.01)
X['bun'] = X['bun'].fillna(6.51)
X['wblc'] = X['wblc'].fillna(9000)
X['urine'] = X['urine'].fillna(2502)

In [18]:
X.head()

Unnamed: 0,age,sex,dzgroup,dzclass,num.co,edu,income,scoma,charges,totcst,...,bili,crea,sod,ph,glucose,bun,urine,adlp,adls,adlsc
0,62.84998,male,Lung Cancer,Cancer,0,11.0,$11-$25k,0.0,9715.0,,...,0.199982,1.199951,141.0,7.459961,,6.51,2502.0,7.0,7.0,7.0
1,60.33899,female,Cirrhosis,COPD/CHF/Cirrhosis,2,12.0,$11-$25k,44.0,34496.0,,...,1.01,5.5,132.0,7.25,,6.51,2502.0,,1.0,1.0
2,52.74698,female,Cirrhosis,COPD/CHF/Cirrhosis,2,12.0,under $11k,0.0,41094.0,,...,2.199707,2.0,134.0,7.459961,,6.51,2502.0,1.0,0.0,0.0
3,42.38498,female,Lung Cancer,Cancer,2,11.0,under $11k,0.0,3075.0,,...,1.01,0.799927,139.0,,,6.51,2502.0,0.0,0.0,0.0
4,79.88495,female,ARF/MOSF w/Sepsis,ARF/MOSF,1,,,26.0,50127.0,,...,1.01,0.799927,143.0,7.509766,,6.51,2502.0,,2.0,2.0


In [19]:
X.describe()

Unnamed: 0,age,num.co,edu,scoma,charges,totcst,totmcst,avtisst,sps,aps,...,bili,crea,sod,ph,glucose,bun,urine,adlp,adls,adlsc
count,9105.0,9105.0,7471.0,9104.0,8933.0,8217.0,5630.0,9023.0,9104.0,9104.0,...,9105.0,9105.0,9104.0,6821.0,4605.0,9105.0,9105.0,3464.0,6238.0,9105.0
mean,62.650823,1.868644,11.747691,12.058546,59995.79,30825.867768,28828.877838,22.610928,25.525872,37.597979,...,2.113261,1.765361,137.568541,7.415364,159.873398,19.998739,2357.326071,1.15791,1.637384,1.888272
std,15.59371,1.344409,3.447743,24.636694,102648.8,45780.820986,43604.261932,13.233248,9.899377,19.903852,...,4.548787,1.681084,6.029326,0.080563,88.391541,23.265787,1005.3585,1.739672,2.231358,2.003763
min,18.04199,0.0,0.0,0.0,1169.0,0.0,-102.71997,1.0,0.199982,0.0,...,0.099991,0.099991,110.0,6.829102,0.0,1.0,0.0,0.0,0.0,0.0
25%,52.797,1.0,10.0,0.0,9740.0,5929.5664,5177.4043,12.0,19.0,23.0,...,0.599976,0.899902,134.0,7.379883,103.0,6.51,2075.0,0.0,0.0,0.0
50%,64.85699,2.0,12.0,0.0,25024.0,14452.7344,13223.5,19.5,23.898438,34.0,...,1.01,1.199951,137.0,7.419922,135.0,6.51,2502.0,0.0,1.0,1.0
75%,73.99896,3.0,14.0,9.0,64598.0,36087.9375,34223.6016,31.666656,30.199219,49.0,...,1.299805,1.899902,141.0,7.469727,188.0,24.0,2502.0,2.0,3.0,3.0
max,101.84796,9.0,31.0,100.0,1435423.0,633212.0,710682.0,83.0,99.1875,143.0,...,63.0,21.5,181.0,7.769531,1092.0,300.0,9000.0,7.0,7.0,7.073242


Let's drop the 2 targets that I won't use:

In [25]:
y = y.drop(["hospdead", "sfdm2"], axis=1)
y.head()

Unnamed: 0,death
0,0
1,1
2,1
3,1
4,0


If I want to remove rows, I need to combine x and y