# Phase 1 – Exploratory Data Analysis



## 1.1 Basic Data Description and Characteristics


# A)

| **Dataset** | **Number of Records** | **Number of Attributes** | **Main collums** | **Description** |
|--------------|------------------------|----------------------------|-----------------|-----------------|
| **patient.csv** | 2,068 | 13 |username, registration, address, ssn, residence, birthdate, sex, age, bmi, height, weight, diagnosis, treatment| Contains patient information – demographic data (age, sex, address) and basic clinical characteristics (BMI, diagnosis, treatment). |
| **station.csv** | 832 | 6 |station, latitude, revision, longitude, code, location|  Data about individual measurement stations – GPS coordinates, station code, name, and hardware revision. |
| **observation.csv** | 12,046 | 23 | SpO₂, HR, PI, RR, EtCO₂, FiO₂, PRV, BP, Skin Temp, Core Temp, O₂Flow, pH, PaO₂, PaCO₂, Na⁺, K⁺, Ca²⁺, Hb, Hct, Glu, Lac, patient, station| Records of patient vital parameter measurements, including oxygen saturation (SpO₂), heart rate, respiratory rate, and other biochemical indicators. |

In total, the patient.csv file contains 2,068 patients, all of whom have a valid station_ID referencing one of the 832 stations listed in station.csv (IDs range from 0 to 831).
The observation.csv table includes 12,046 measurements, and 100 % of them can be matched to a corresponding station using identical latitude and longitude coordinates.
This means that every patient and every observation can be correctly linked to its station.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "DejaVu Sans" 

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

In [None]:
def read_any(path):
    try:
        return pd.read_csv(path, sep=None, engine="python", encoding="utf-8-sig")
    except Exception:
        return pd.read_csv(path, sep=";", engine="python", encoding="utf-8-sig")

patient = read_any("data/patient.csv")
station = read_any("data/station.csv")
observation = read_any("data/observation.csv")
sensor_range = read_any("data/sensor_variable_range.csv")

(len(patient), len(station), len(observation), len(sensor_range))


In [None]:
def quick_profile(df, name):
    print(f"\n================================= {name} =================================")
    print(f"Rows: {len(df):,} | Columns: {len(df.columns)}")
    print("Columns:", list(df.columns))
    print("\nData types:")
    display(df.dtypes.to_frame("dtype"))
    print("\nMissing values per column (top 10):")
    display(df.isna().sum().sort_values(ascending=False).head(10))
    print("\nSample rows:")
    display(df.head(5))

quick_profile(patient, "patient")
quick_profile(station, "station")
quick_profile(observation, "observation")
quick_profile(sensor_range, "sensor_variable_range")


In [None]:
if "station_ID" in patient.columns:
    sid = pd.to_numeric(patient["station_ID"], errors="coerce").dropna().astype(int)
    print("patient.station_ID range:", int(sid.min()), "to", int(sid.max()), "| station rows:", len(station))

if {"latitude","longitude"}.issubset(observation.columns) and {"latitude","longitude"}.issubset(station.columns):
    st_pairs = set(zip(pd.to_numeric(station["latitude"], errors="coerce").round(6), pd.to_numeric(station["longitude"], errors="coerce").round(6)))
    obs_pairs = list(zip(pd.to_numeric(observation["latitude"], errors="coerce").round(6), pd.to_numeric(observation["longitude"], errors="coerce").round(6)))
    mapped = sum(1 for p in obs_pairs if p in st_pairs)
    print(f"Observations with a station coordinate match: {mapped:,} / {len(obs_pairs):,}")


# B)

We analyzed 12 numeric attributes from observation.csv (SpO₂, HR, PI, RR, EtCO₂, FiO₂, PRV, Skin Temperature, PVI, Hb level, SV, CO). For each attribute, we computed descriptive statistics and compared values against the prescribed ranges from sensor_variable_range.csv. Across 12,046 observations per attribute, there were no missing values and 0% of values fell outside the expected ranges. Min–max values for each attribute matched the stated bounds, indicating full compliance.
Distributions were visualized with histograms and boxplots. The alignment with ranges suggests that the dataset is potentially normalized to those limits (evidence: min and max equal the exact bounds for all variables). 

In [None]:
candidates = ["SpO₂", "HR", "PI", "RR", "EtCO₂", "FiO₂", "PRV", "Skin Temperature", "PVI", "Hb level", "SV", "CO"]
vars_to_check = [c for c in candidates if c in observation.columns]
observation_subset = observation[vars_to_check]
observation_subset

In [None]:
obs_num = observation[vars_to_check].apply(pd.to_numeric, errors="coerce")

desc = obs_num.describe().T 
desc["missing"] = obs_num.isna().sum()
desc["non_null"] = obs_num.notna().sum()
desc


In [None]:
for col in vars_to_check:
    s = pd.to_numeric(observation[col], errors="coerce").dropna()
    if s.empty:
        print(f"[{col}] No numeric data.")
        continue

    plt.figure()
    plt.hist(s, bins=30)
    plt.title(f"Histogram — {col}")
    plt.xlabel("Value"); plt.ylabel("Frequency")
    plt.show()

    plt.figure()
    plt.boxplot(s, vert=True)
    plt.title(f"Boxplot — {col}")
    plt.ylabel("Value")
    plt.show()


In [None]:
import re

def check_ranges(row):
    var = row['Variable']
    rng = row['Value Range']

    if not isinstance(var, str) or var not in obs_num.columns or not isinstance(rng, str):
        return None

    m = re.search(r'(-?\d+(?:\.\d+)?)\s*[-–—]\s*(-?\d+(?:\.\d+)?)', rng)
    if not m:
        return None

    lo, hi = float(m.group(1)), float(m.group(2))

    s = obs_num[var].dropna()
    if s.empty:
        return None

    below = int((s < lo).sum())
    above = int((s > hi).sum())
    total = int(s.size)
    return {
        'variable': var, 'expected_low': lo, 'expected_high': hi,
        'values_checked': total, 'below_range': below, 'above_range': above,
        'outside_%': round(100 * (below + above) / total, 2)
    }

results = sensor_range.apply(check_ranges, axis=1).dropna()
pd.DataFrame(results.tolist()).sort_values('outside_%', ascending=False)


# C

In the heatmap, besides predominantly weak correlations, there is a negative correlation between Skin Temperature and BP (Blood Pressure).
This means that when the skin temperature increases, blood pressure tends to decrease.
From a modeling perspective, it indicates that these two variables are not completely independent and may share a common underlying factor, such as the body’s thermoregulatory response.
The heatmap also reveals a strong positive correlation around 0.7 between Oximetry and EtCO₂. Both variables describe related aspects of respiratory function, which explains their close relationship.
They may carry overlapping information and should be treated as correlated predictors.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

num_df = observation.select_dtypes(include="number")

corr = num_df.corr()

corr_pairs = (
    corr.unstack()
    .reset_index()
    .rename(columns={"level_0": "Variable 1", "level_1": "Variable 2", 0: "r"})
)
corr_pairs = corr_pairs[corr_pairs["Variable 1"] < corr_pairs["Variable 2"]]  
corr_pairs["abs_r"] = corr_pairs["r"].abs()
corr_pairs.sort_values("abs_r", ascending=False).head(10)


In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation heatmap of numeric attributes")
plt.show()


# D
The correlation analysis between the target variable SpO₂ and all numeric predictors shows that all correlation coefficients are very close to zero (|r| < 0.02).
This indicates that no single variable has a strong linear relationship with SpO₂.
The highest yet still weak positive correlations appear for EtCO₂ and Oximetry, while slightly negative relationships are observed for Respiratory effort, BP, and Hb level.
From a modeling perspective, this suggests that SpO₂ depends on multiple factors in a non-linear manner, rather than being driven by a single strong predictor.

In [None]:
target = "SpO₂"
predictors = [v for v in num_df.columns if v != target]

target_corr = num_df[predictors + [target]].corr()[target].drop(target).sort_values(ascending=False)
target_corr


In [None]:
plt.figure(figsize=(10,4))
target_corr.plot(kind="bar", color="steelblue")
plt.title("Correlation of predictors with SpO₂ (target variable)")
plt.ylabel("Correlation coefficient (r)")
plt.show()


# E

**Are some attributes dependent on each other?**

Most pairs of numerical attributes show weak linear correlations (the heatmap is mostly light blue).

Notable exceptions:
- Skin Temperature <-> BP: a negative correlation, meaning higher skin temperature is associated with lower blood pressure.
- Oximetry <-> EtCO₂: a strong positive correlation around 0.7, as both describe aspects of respiratory function and may carry overlapping informations.

There is no critical multicollinearity among all sensors, but certain pairs are related and should be handled carefully in modeling.

**Which attributes does the predicted variable depend on? (SpO₂)**

The correlations between SpO₂ and all other numeric predictors are very small.This indicates that no single strong linear predictor exists for SpO₂. Its value is likely influenced by multiple factors and non-linear relationships.

**Is it necessary to combine records from multiple files?**

Yes, merging the datasets increases the information content and context for analysis:
- observation – contains the primary physiological measurements.
- station – adds contextual information.
- patient – provides patient characteristics.

Practically, the datasets should be joined via shared keys like station_ID, patient_id, and derived features like is_out_of_range, deviations, time-based aggregations can be added.





# 1.2 Problem identification, data integration and cleaning

# A)

**Data structure and relationships**

All files were successfully imported in CSV format with UTF-8-SIG encoding, preventing problems with special characters or accents. Column names contained occasional extra spaces or inconsistent dash symbols ("–", "—", "-"), which were standardized.

The relationships between tables were verified:
- patient.station_ID correctly links to stations in station.csv.
- observation records can be matched to stations using geographical coordinates (latitude, longitude).

The overall structure is coherent and ready for integration.

**Missing or incomplete values**

Several numeric columns contain missing (NaN) or zero-like placeholder values.

**Duplicate records**

The dataset was tested for duplicate rows using the df.duplicated().sum() function. No duplicate entries were found in any of the files.

**Inconsistent data formats**

Some numeric variables are stored as strings or contain non-numeric symbols, they were converted to proper numeric types using pd.to_numeric(errors="coerce"). Text columns containing names, addresses, or codes were kept as strings.

**Outliers**

All numerical variables were compared with their reference limits from sensor_variable_range.csv. Every measurement fell within the prescribed physiological ranges (0% outside).

In [None]:
dup_check = pd.DataFrame({
    "Dataset": ["patient", "station", "observation"],
    "Duplicate Rows": [
        patient.duplicated().sum(),
        station.duplicated().sum(),
        observation.duplicated().sum()
    ]
})
dup_check