# **DATA CLEANING AND ENCODING**

## Objectives

- Clean and preprocess the discovery and validation clinical cohorts  
- Handle duplicates, missing values, and clinically relevant categorical data  
- Encode categorical variables for further analysis and modeling

## Inputs

- `data/clinical_data_discovery_cohort.csv`  
- `data/clinical_data_validation_cohort.xlsx`

## Outputs

- Cleaned and encoded datasets saved to `/data/cleaned/`  
- Ready for exploratory data analysis (Notebook 2)

## Additional Comments

* NA



---

# Change working directory

Working directory changed from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\petal\\Downloads\\CI-DBC\\vscode-projects\\clinical-survival-analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\petal\\Downloads\\CI-DBC\\vscode-projects\\clinical-survival-analysis'

# Section 1

# Load and Quick Check

In [4]:
import pandas as pd
import numpy as np

# Load data
discovery = pd.read_csv("data/clinical_data_discovery_cohort.csv")
validation = pd.read_excel("data/clinical_data_validation_cohort.xlsx")

print("Discovery shape:", discovery.shape)
print("Validation shape:", validation.shape)


Discovery shape: (30, 10)
Validation shape: (95, 14)


# Handle Duplicates

In [5]:
print("Discovery duplicates:", discovery.duplicated().sum())
print("Validation duplicates:", validation.duplicated().sum())

# Drop if any duplicates found
discovery.drop_duplicates(inplace=True)
validation.drop_duplicates(inplace=True)


Discovery duplicates: 0
Validation duplicates: 0


# Handle Missing / Special Cases

In [6]:
# Clinical note: 'Type.Adjuvant' NaN may indicate NO adjuvant therapy
validation['Type.Adjuvant'] = validation['Type.Adjuvant'].fillna("No_Adjuvant_Therapy")

# Optional: fill other missing categorical columns with 'Unknown' or keep NaN if truly unknown
validation['EGFR'] = validation['EGFR'].fillna("Unknown")
validation['KRAS'] = validation['KRAS'].fillna("Unknown")

# Confirm missing values after handling
print("Remaining missing values per column:")
print(validation.isnull().sum())


Remaining missing values per column:
Patient ID                    0
Survival time (days)          0
Event (death: 1, alive: 0)    0
Tumor size (cm)               0
Grade                         0
Stage (TNM 8th edition)       0
Age                           0
Sex                           0
Cigarette                     0
Pack per year                 0
Type.Adjuvant                 0
batch                         0
EGFR                          0
KRAS                          0
dtype: int64


> **Clinical Note:**  
> The `Type.Adjuvant` column contains many NaN entries.  
> In oncology, absence of adjuvant therapy may be clinically valid (early-stage cancer, complete resection, or patient preference).  
> These were encoded as `"No_Adjuvant_Therapy"` to preserve clinical meaning rather than treating them as missing.


---

# Section 2

# Convert Data Types

In [7]:
# Convert date columns in discovery cohort
date_cols = ['Specimen date', 'Date of Death', 'Date of Last Follow Up']
for col in date_cols:
    discovery[col] = pd.to_datetime(discovery[col], errors='coerce')

# Convert numeric columns in validation cohort
validation['Survival time (days)'] = pd.to_numeric(validation['Survival time (days)'], errors='coerce')


# Encode Categorical Variables for Modeling

In [8]:

# 1️ Create simplified mutation status columns for KRAS and EGFR
validation['KRAS_status'] = validation['KRAS'].apply(
    lambda x: 'Wild-type' if x == 'Negative'
              else ('Mutated' if pd.notnull(x) else 'Unknown')
)

validation['EGFR_status'] = validation['EGFR'].apply(
    lambda x: 'Wild-type' if x == 'Negative'
              else ('Mutated' if pd.notnull(x) else 'Unknown')
)

# 2️ Define categorical columns for encoding
categorical_cols_disc = ['sex', 'race', 'Stage']
categorical_cols_val = [
    'Stage (TNM 8th edition)',
    'Sex',
    'Type.Adjuvant',
    'EGFR_status',   # Simplified for non-clinical clarity
    'KRAS_status',   # Simplified for non-clinical clarity
    'Cigarette'
]

# 3️ Encode with pandas.get_dummies for ML readiness
disc_encoded = pd.get_dummies(discovery, columns=categorical_cols_disc, drop_first=True)
val_encoded = pd.get_dummies(validation, columns=categorical_cols_val, drop_first=True)

# 4️ Save to cleaned folder
disc_encoded.to_csv("data/cleaned/discovery_clean.csv", index=False)
val_encoded.to_csv("data/cleaned/validation_clean.csv", index=False)

print("Encoding complete. Cleaned datasets saved to /data/cleaned/")



Encoding complete. Cleaned datasets saved to /data/cleaned/


### Observations
- No duplicates detected to be removed
- `Type.Adjuvant` NaNs converted to `"No_Adjuvant_Therapy"` for clinical integrity
- EGFR and KRAS missing values replaced with `"Unknown"`
- Dates converted to datetime
- Cleaned and encoded datasets saved to `/data/cleaned/`
- Ready for exploratory analysis in Notebook 2


---

# Next Steps

* Further exploratory analysis in Notebook 2