# Data cleaning
The objective of this notebook is to clean and validate the datasets used in this project
(primary, study, and validation cohorts) in order to ensure data consistency, integrity,
and suitability for downstream machine learning tasks.


In [14]:
import pandas as pd
from pathlib import Path

DATA_PATH = Path("../data")

primary = pd.read_csv(DATA_PATH/"primary_cohort.csv")
study = pd.read_csv(DATA_PATH/"study_cohort.csv")
validation = pd.read_csv(DATA_PATH/"validation_cohort.csv")

cohorts = {
    "primary": primary,
    "study": study,
    "validation": validation
}


## Verify data consistency (datasets)

Three cohorts are used in this project:
- Primary cohort
- Study cohort
- Validation cohort

All cohorts share the same schema and are cleaned using identical rules to ensure
consistency across the data pipeline.

### Age


In [15]:
primary["age_years"].describe()

count    110204.000000
mean         62.735255
std          24.126806
min           0.000000
25%          51.000000
50%          68.000000
75%          81.000000
max         100.000000
Name: age_years, dtype: float64

In [16]:
(primary["age_years"] < 0).sum()
(primary["age_years"] > 120).sum()

np.int64(0)

### Gender

In [17]:
primary["sex_0male_1female"].value_counts()


sex_0male_1female
0    57973
1    52231
Name: count, dtype: int64

### Hopital Outcome

In [18]:
primary["hospital_outcome_1alive_0dead"].value_counts()

hospital_outcome_1alive_0dead
1    102099
0      8105
Name: count, dtype: int64

## Verify the structure
Pre-cleaning data validation

Before applying any cleaning operations, basic validation checks were performed to
identify potential data quality issues, including invalid values, missing values,
and duplicated rows.

In [19]:
primary.columns.equals(study.columns)
primary.columns.equals(validation.columns)


True

## Intra-cohort cleaning
### Minimal clean

In [20]:
for name, df in cohorts.items():
    print(f"\n--- {name.upper()} COHORT ---")
    print(df.shape)
    print(df.isna().sum())
    print("Duplicated rows:", df.duplicated().sum())


--- PRIMARY COHORT ---
(110204, 4)
age_years                        0
sex_0male_1female                0
episode_number                   0
hospital_outcome_1alive_0dead    0
dtype: int64
Duplicated rows: 108693

--- STUDY COHORT ---
(19051, 4)
age_years                        0
sex_0male_1female                0
episode_number                   0
hospital_outcome_1alive_0dead    0
dtype: int64
Duplicated rows: 17861

--- VALIDATION COHORT ---
(137, 4)
age_years                        0
sex_0male_1female                0
episode_number                   0
hospital_outcome_1alive_0dead    0
dtype: int64
Duplicated rows: 33


In [21]:
def clean_cohort(df):
    df = df.copy()
    
    df = df.drop_duplicates()

    df = df[(df["age_years"] >= 0) & (df["age_years"] <= 120)]

    df = df[df["sex_0male_1female"].isin([0, 1])]
    df = df[df["hospital_outcome_1alive_0dead"].isin([0, 1])]

    return df




## Cleaning rules

The following cleaning rules were applied uniformly across all cohorts:

- Removal of duplicated rows
- Removal of records with invalid ages (age < 0 or age > 120)
- Enforcement of binary encoding for sex (0 = male, 1 = female)
- Enforcement of binary encoding for hospital outcome (0 = deceased, 1 = alive)

These rules are based on basic data validity constraints and do not rely on the target
distribution or any modeling assumptions.


In [22]:
for name in cohorts:
    cohorts[name] = clean_cohort(cohorts[name])

## Cleaning impact summary

The cleaning process resulted in the removal of a small number of records that violated
basic validity constraints. The impact of the cleaning process was monitored for each
cohort to ensure that no unintended data loss occurred.


## Post-cleaning validation

After cleaning, all datasets were re-validated to confirm:
- absence of duplicated rows
- valid value ranges for all variables
- consistent data types
- absence of missing values

The cleaned datasets are considered ready for feature engineering and modeling.


In [23]:
for name, df in cohorts.items():
    print(f"\n--- {name.upper()} AFTER CLEANING ---")
    print(df.shape)
    print("Duplicated rows:", df.duplicated().sum())


--- PRIMARY AFTER CLEANING ---
(1511, 4)
Duplicated rows: 0

--- STUDY AFTER CLEANING ---
(1190, 4)
Duplicated rows: 0

--- VALIDATION AFTER CLEANING ---
(104, 4)
Duplicated rows: 0


## Create a clean, final CSV file

In [24]:
cohorts["primary"].to_csv(DATA_PATH/"primary_cohort_clean.csv", index=False)
cohorts["study"].to_csv(DATA_PATH/"study_cohort_clean.csv", index=False)
cohorts["validation"].to_csv(DATA_PATH/"validation_cohort_clean.csv", index=False)

## Conclusion

This data cleaning step ensures that all cohorts are structurally consistent and free
from basic data quality issues. The cleaned datasets are saved as separate CSV files
and will be used as input for the next phases of the project.
