# Exploring the dataset

This notebook focuses on a high-level structural exploration of the primary cohort dataset.
The objective is to verify data integrity, understand basic characteristics, and ensure
the dataset is suitable for further analysis.

Detailed exploratory data analysis (EDA) and feature relationships are intentionally
deferred to later phases of the project.


In [1]:
import pandas as pd
from pathlib import Path

DATA_PATH = Path("../data")

primary = pd.read_csv(DATA_PATH/"primary_cohort.csv")

primary.head()





Unnamed: 0,age_years,sex_0male_1female,episode_number,hospital_outcome_1alive_0dead
0,21,1,1,1
1,20,1,1,1
2,21,1,1,1
3,77,0,1,1
4,72,0,1,1


### Dataset size and dimensionality

In [2]:
print(len(primary))
print(primary.shape)

110204
(110204, 4)


### Data types verification

In [3]:
#same for all the other dataset
print(primary.dtypes)

age_years                        int64
sex_0male_1female                int64
episode_number                   int64
hospital_outcome_1alive_0dead    int64
dtype: object


### Detection of missing values

In [4]:

missing = primary.isnull().sum()
missing = missing[missing>0]
missing
#print(f"Total missing values in the dataset: {missing}")


Series([], dtype: int64)

No missing values were detected in the primary cohort dataset.

### Detection of duplicated rows

In [5]:
n_duplicates = primary.duplicated().sum()
print(f"Number of duplicated rows: {n_duplicates}")

Number of duplicated rows: 108693


### Target variable analysis

In [6]:
primary['hospital_outcome_1alive_0dead'].value_counts()
primary['hospital_outcome_1alive_0dead'].value_counts(normalize=True)

hospital_outcome_1alive_0dead
1    0.926455
0    0.073545
Name: proportion, dtype: float64

The target variable is imbalanced, which may impact model performance and will require attention during modeling.

Conclusion:
The dataset is structurally consistent and can be used for downstream
data cleaning, feature engineering, and modeling.
