# 01 — Data Exploration
> MIMIC-IV Sepsis DRL — Hourly-Binned Parquet Verisi Keşfi

Bu notebook, `data/processed/mimic_hourly_binned.parquet` dosyasını **Polars + PyArrow** ile açıp temel kontrolleri yapar:
1. Dosya doğru açılıyor mu?
2. Kaç satır × kaç sütun?
3. Sütun isimleri ve veri tipleri (schema)
4. Null / missing oranları
5. Her sütundaki unique değer sayısı
6. Temel istatistikler (describe)
7. `stay_id` başına satır sayısı dağılımı

In [1]:
import polars as pl
import pyarrow.parquet as pq
from pathlib import Path

PARQUET_PATH = Path("..") / "data" / "processed" / "mimic_hourly_binned.parquet"
print(f"Dosya mevcut mu? → {PARQUET_PATH.exists()}")
print(f"Dosya boyutu  → {PARQUET_PATH.stat().st_size / 1e6:.1f} MB")

Dosya mevcut mu? → True
Dosya boyutu  → 182.1 MB


## 1 · PyArrow ile Schema (Hızlı Metadata Kontrolü)
Dosyayı belleğe yüklemeden sadece metadata'yı okuyalım.

In [2]:
pf = pq.ParquetFile(PARQUET_PATH)

print(f"Satır sayısı   : {pf.metadata.num_rows:,}")
print(f"Row-group sayısı: {pf.metadata.num_row_groups}")
print(f"Sütun sayısı   : {pf.metadata.num_columns}")
print()
print("=== PyArrow Schema ===")
print(pf.schema_arrow)

Satır sayısı   : 8,808,129
Row-group sayısı: 72
Sütun sayısı   : 43

=== PyArrow Schema ===
stay_id: int64
hour_bin: timestamp[us]
heart_rate: double
sbp: double
dbp: double
mbp: double
resp_rate: double
spo2: double
temp_c: double
fio2: double
lactate: double
creatinine: double
bilirubin_total: double
platelet: double
wbc: double
bun: double
glucose: double
sodium: double
potassium: double
hemoglobin: double
hematocrit: double
bicarbonate: double
chloride: double
anion_gap: double
inr: double
pao2: double
paco2: double
ph: double
urine_output: double
norepinephrine_dose: double
epinephrine_dose: double
phenylephrine_dose: double
vasopressin_dose: double
dopamine_dose: double
dobutamine_dose: double
crystalloid_ml: double
gcs_eye: double
gcs_motor: double
gcs_verbal: double
gcs_total: double
gender: large_string
age: int64
admission_type: large_string


## 2 · Polars ile Yükleme & İlk Bakış

In [3]:
df = pl.read_parquet(PARQUET_PATH)
print(f"Shape: {df.shape}  →  {df.shape[0]:,} satır × {df.shape[1]} sütun")
df.head(10)

Shape: (8808129, 43)  →  8,808,129 satır × 43 sütun


stay_id,hour_bin,heart_rate,sbp,dbp,mbp,resp_rate,spo2,temp_c,fio2,lactate,creatinine,bilirubin_total,platelet,wbc,bun,glucose,sodium,potassium,hemoglobin,hematocrit,bicarbonate,chloride,anion_gap,inr,pao2,paco2,ph,urine_output,norepinephrine_dose,epinephrine_dose,phenylephrine_dose,vasopressin_dose,dopamine_dose,dobutamine_dose,crystalloid_ml,gcs_eye,gcs_motor,gcs_verbal,gcs_total,gender,age,admission_type
i64,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,i64,str
30000153,2174-09-29 12:00:00,100.0,136.0,74.0,89.0,18.0,100.0,,75.0,,,,,,,,,,,35.0,,,,,,,,280.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 13:00:00,104.0,132.0,74.5,84.0,16.0,100.0,,75.0,1.3,,,,,,,,,,35.0,,,,,221.0,45.0,7.3,280.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 14:00:00,83.0,131.0,61.0,80.0,16.0,100.0,,75.0,2.1,,,,,,,,,,35.0,,,,,263.0,45.0,7.3,45.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 15:00:00,92.0,123.0,65.0,84.0,14.0,100.0,,50.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,31.7,19.0,115.0,12.0,1.1,263.0,45.0,7.3,50.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 16:00:00,83.0,109.0,55.0,71.0,16.0,100.0,,50.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,31.7,19.0,115.0,12.0,1.1,215.0,42.0,7.31,50.0,0.0,0.0,0.0,0.0,0.0,0.0,941.299999,4.0,6.0,1.0,11.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 17:00:00,103.0,111.0,56.0,71.0,20.0,100.0,,50.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,31.7,19.0,115.0,12.0,1.1,215.0,42.0,7.31,45.0,0.0,0.0,0.0,0.0,0.0,0.0,941.299999,4.0,6.0,1.0,11.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 18:00:00,111.0,133.0,63.0,83.0,19.0,99.0,,50.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,31.7,19.0,115.0,12.0,1.1,215.0,42.0,7.31,70.0,0.0,0.0,0.0,0.0,0.0,0.0,941.299999,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 19:00:00,123.0,155.0,68.0,91.0,21.0,96.0,,50.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,32.1,19.0,115.0,12.0,1.1,215.0,42.0,7.31,70.0,0.0,0.0,0.0,0.0,0.0,0.0,941.299999,3.0,5.0,1.0,9.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 20:00:00,128.0,122.0,67.0,83.0,21.0,98.0,,40.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,32.1,19.0,115.0,12.0,1.1,215.0,42.0,7.31,70.0,0.0,0.0,0.0,0.0,0.0,0.0,941.299999,3.0,6.0,3.0,12.0,"""M""",61,"""EW EMER."""
30000153,2174-09-29 21:00:00,123.0,136.0,67.0,87.0,22.0,96.0,,40.0,2.1,0.9,,173.0,17.0,22.0,192.0,142.0,4.4,10.8,32.1,19.0,115.0,12.0,1.1,215.0,42.0,7.31,80.0,0.0,0.0,0.0,0.0,0.0,0.0,199.999995,3.0,6.0,3.0,12.0,"""M""",61,"""EW EMER."""


## 3 · Sütun İsimleri ve Veri Tipleri

In [4]:
schema_df = pl.DataFrame({
    "column": df.columns,
    "dtype": [str(dt) for dt in df.dtypes],
})
schema_df

column,dtype
str,str
"""stay_id""","""Int64"""
"""hour_bin""","""Datetime(time_unit='us', time_…"
"""heart_rate""","""Float64"""
"""sbp""","""Float64"""
"""dbp""","""Float64"""
…,…
"""gcs_verbal""","""Float64"""
"""gcs_total""","""Float64"""
"""gender""","""String"""
"""age""","""Int64"""


## 4 · Null / Missing Oranları
Her sütundaki null sayısı ve yüzdesi.

In [5]:
null_counts = df.null_count()
total_rows = df.shape[0]

null_df = pl.DataFrame({
    "column": df.columns,
    "null_count": [null_counts[col][0] for col in df.columns],
    "null_pct": [round(null_counts[col][0] / total_rows * 100, 2) for col in df.columns],
}).sort("null_pct", descending=True)

print(f"Toplam satır: {total_rows:,}")
null_df

Toplam satır: 8,808,129


column,null_count,null_pct
str,i64,f64
"""temp_c""",7343072,83.37
"""bilirubin_total""",3284098,37.28
"""fio2""",2850062,32.36
"""lactate""",2467801,28.02
"""paco2""",2281818,25.91
…,…,…
"""stay_id""",0,0.0
"""hour_bin""",0,0.0
"""gender""",0,0.0
"""age""",0,0.0


## 5 · Unique Değer Sayıları
Her sütundaki unique (benzersiz) eleman sayısı.

In [6]:
unique_df = pl.DataFrame({
    "column": df.columns,
    "n_unique": [df[col].n_unique() for col in df.columns],
}).sort("n_unique", descending=True)

unique_df

column,n_unique
str,i64
"""hour_bin""",846870
"""crystalloid_ml""",739641
"""norepinephrine_dose""",320794
"""phenylephrine_dose""",141675
"""stay_id""",94458
…,…
"""gcs_motor""",28
"""gcs_verbal""",27
"""gcs_eye""",22
"""admission_type""",9


## 6 · Temel İstatistikler (describe)

In [7]:
df.describe()

statistic,stay_id,hour_bin,heart_rate,sbp,dbp,mbp,resp_rate,spo2,temp_c,fio2,lactate,creatinine,bilirubin_total,platelet,wbc,bun,glucose,sodium,potassium,hemoglobin,hematocrit,bicarbonate,chloride,anion_gap,inr,pao2,paco2,ph,urine_output,norepinephrine_dose,epinephrine_dose,phenylephrine_dose,vasopressin_dose,dopamine_dose,dobutamine_dose,crystalloid_ml,gcs_eye,gcs_motor,gcs_verbal,gcs_total,gender,age,admission_type
str,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,str
"""count""",8808129.0,"""8808129""",8770031.0,8740315.0,8740024.0,8741935.0,8760778.0,8764642.0,1465057.0,5958067.0,6340328.0,8327579.0,5524031.0,8299103.0,8295935.0,8326449.0,8211747.0,8271772.0,8293823.0,8297261.0,8316840.0,8325363.0,8331191.0,8310005.0,7748676.0,6527544.0,6526311.0,6644951.0,8293822.0,7194083.0,7194083.0,7194083.0,7194083.0,7194083.0,7194083.0,7194083.0,8700945.0,8696527.0,8698579.0,8702212.0,"""8808129""",8808129.0,"""8808129"""
"""null_count""",0.0,"""0""",38098.0,67814.0,68105.0,66194.0,47351.0,43487.0,7343072.0,2850062.0,2467801.0,480550.0,3284098.0,509026.0,512194.0,481680.0,596382.0,536357.0,514306.0,510868.0,491289.0,482766.0,476938.0,498124.0,1059453.0,2280585.0,2281818.0,2163178.0,514307.0,1614046.0,1614046.0,1614046.0,1614046.0,1614046.0,1614046.0,1614046.0,107184.0,111602.0,109550.0,105917.0,"""0""",0.0,"""0"""
"""mean""",34974000.0,"""2153-10-15 03:38:30.535507""",87.819938,120.478139,65.184255,84.512153,21.169934,137.767256,38.354066,46.397577,3.263103,1.481856,2.209595,219.682421,12.125175,30.943375,137.819228,139.020176,4.103613,9.777631,30.068222,24.765002,103.07367,13.531571,1.434417,100.873874,42.377398,7.396889,184.733394,0.176258,0.063585,0.737237,0.188185,0.782889,1.025641,251.693691,3.385539,5.282393,3.317324,11.923467,,62.641503,
"""std""",2884300.0,,3797.403494,491.228532,259.152687,4828.722159,2407.519375,19406.703678,9.811226,49.24549,1433.422544,1.46838,5.210624,132.416995,8.458586,25.204517,57.215769,5.358748,0.565285,1.967902,5.714107,5.060515,6.661833,3.992657,0.656184,58.592579,10.528077,0.071144,364.241888,1.054701,8.839282,6.477451,2.627308,14.516527,17.072216,1572.401753,1.003951,1.494538,1.858582,3.831584,,16.111648,
"""min""",30000153.0,"""2110-01-11 10:00:00""",-241395.0,-94.0,-40.0,-9806.0,0.0,-951234.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,67.0,0.8,0.0,0.0,2.0,39.0,-24.0,0.5,-32.0,0.0,0.94,-3765.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,"""F""",18.0,"""AMBULATORY OBSERVATION"""
"""25%""",32477246.0,"""2133-12-07 02:00:00""",73.0,104.0,53.0,69.0,16.0,95.0,36.6,40.0,1.0,0.7,0.4,131.0,7.7,14.0,104.0,136.0,3.7,8.3,25.8,22.0,99.0,11.0,1.1,61.0,36.0,7.36,50.0,0.0,0.0,0.0,0.0,0.0,0.0,47.833335,3.0,5.0,1.0,10.0,,53.0,
"""50%""",34965363.0,"""2153-08-17 00:00:00""",85.0,118.0,62.0,78.0,19.25,97.0,37.1,40.0,1.4,1.0,0.6,196.0,10.6,23.0,125.0,139.0,4.0,9.4,29.1,24.0,103.0,13.0,1.2,92.0,41.0,7.4,120.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,4.0,6.0,4.0,14.0,,64.0,
"""75%""",37460082.0,"""2173-11-27 22:00:00""",98.0,134.0,73.0,89.0,24.0,99.0,37.6,50.0,1.9,1.7,1.4,280.0,14.5,39.0,155.0,142.0,4.4,11.0,33.5,28.0,107.0,16.0,1.5,125.0,47.0,7.45,250.0,0.0,0.0,0.0,0.0,0.0,0.0,295.833342,4.0,6.0,5.0,15.0,,75.0,
"""max""",39999858.0,"""2214-08-11 05:00:00""",10000000.0,1003110.0,114109.0,8999090.0,7000400.0,9900000.0,987.4,40100.0,1276103.0,80.0,87.2,2385.0,572.5,305.0,5840.0,185.0,26.5,24.6,71.2,50.0,155.0,89.0,27.4,4242.0,243.0,7.96,876587.0,1099.999975,4740.164044,1000.00005,399.999997,1008.783077,1023.107846,1000400.0,4.0,6.0,5.0,15.0,"""M""",91.0,"""URGENT"""


## 7 · `stay_id` Başına Satır Dağılımı
Her hastanın kaç saatlik verisi var?

In [8]:
stay_counts = (
    df.group_by("stay_id")
    .agg(pl.len().alias("n_hours"))
    .sort("n_hours", descending=True)
)

print(f"Toplam benzersiz stay_id: {stay_counts.shape[0]:,}")
print()
print(stay_counts["n_hours"].describe())
print()
print("En uzun 10 yatış:")
stay_counts.head(10)

Toplam benzersiz stay_id: 94,458

shape: (9, 2)
┌────────────┬────────────┐
│ statistic  ┆ value      │
│ ---        ┆ ---        │
│ str        ┆ f64        │
╞════════════╪════════════╡
│ count      ┆ 94458.0    │
│ null_count ┆ 0.0        │
│ mean       ┆ 93.249158  │
│ std        ┆ 130.288585 │
│ min        ┆ 1.0        │
│ 25%        ┆ 31.0       │
│ 50%        ┆ 53.0       │
│ 75%        ┆ 101.0      │
│ max        ┆ 5411.0     │
└────────────┴────────────┘

En uzun 10 yatış:


stay_id,n_hours
i64,u32
36032605,5411
36307509,4006
39510663,3421
30359303,3269
35629939,3051
31492392,3040
39245279,2683
32380519,2457
38018615,2423
31879957,2386


## 8 · Son 5 Satır (tail)
Datanın sonuna da bakalım, forward-fill düzgün çalışmış mı kontrol edelim.

In [9]:
df.tail(10)

stay_id,hour_bin,heart_rate,sbp,dbp,mbp,resp_rate,spo2,temp_c,fio2,lactate,creatinine,bilirubin_total,platelet,wbc,bun,glucose,sodium,potassium,hemoglobin,hematocrit,bicarbonate,chloride,anion_gap,inr,pao2,paco2,ph,urine_output,norepinephrine_dose,epinephrine_dose,phenylephrine_dose,vasopressin_dose,dopamine_dose,dobutamine_dose,crystalloid_ml,gcs_eye,gcs_motor,gcs_verbal,gcs_total,gender,age,admission_type
i64,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,i64,str
39999858,2167-05-01 06:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.7,0.7,226.0,9.6,16.0,117.0,137.0,4.1,12.7,38.9,28.0,100.0,9.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-02 06:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.8,0.6,263.0,9.2,15.0,135.0,137.0,4.3,12.2,38.0,28.0,101.0,8.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-03 06:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.7,0.6,263.0,9.2,18.0,132.0,139.0,4.5,12.2,38.0,30.0,101.0,8.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-04 07:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.8,0.6,263.0,9.2,16.0,121.0,137.0,4.3,12.2,38.0,32.0,97.0,8.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-05 06:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.8,0.6,305.0,12.2,17.0,103.0,135.0,4.4,13.4,41.4,32.0,95.0,8.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-06 06:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.9,0.6,269.0,12.0,19.0,125.0,136.0,4.4,13.7,41.7,30.0,97.0,9.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-07 08:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.9,0.6,244.0,11.4,18.0,142.0,136.0,4.3,14.2,43.3,28.0,99.0,9.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-08 07:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.9,0.6,214.0,13.2,15.0,154.0,137.0,5.0,13.6,42.6,29.0,99.0,9.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-09 07:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.8,1.4,208.0,12.3,11.0,164.0,135.0,4.8,13.3,39.9,29.0,98.0,8.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
39999858,2167-05-10 08:00:00,82.0,107.0,57.0,69.0,28.0,90.0,,40.0,,0.7,1.2,185.0,11.5,10.0,174.0,132.0,3.9,13.7,42.1,26.0,95.0,11.0,1.3,,,,350.0,0.0,0.0,0.0,0.0,0.0,0.0,249.99999,4.0,6.0,5.0,15.0,"""M""",62,"""EW EMER."""
