# Step 2 — Data Cleaning (SHP, Cross-Section)

This notebook loads the Step-1 dataset (`analysis_dataset_step1.csv`), applies the agreed cleaning rules, and exports a cleaned dataset (`analysis_dataset_step2.csv`).

**Key actions**
- Drop variables with 100% missing values  
- Drop selected variables with high missingness / low relevance  
- Restrict sample to adults (age ≥ 18)  
- Keep an explicit set of analysis variables  
- Export cleaned CSV  


In [None]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)


## 1) Load Step-1 dataset

If your CSV is in the same folder as this notebook, you can keep the default path.
Otherwise, adjust `DATA_PATH`.


In [None]:
DATA_PATH = "analysis_dataset_step1.csv"  # adjust if needed

df = pd.read_csv(DATA_PATH)

print("Initial shape:", df.shape)
display(df.head())


## 2) Drop variables with 100% missing values

These variables had *no valid observations* in the Step-1 export.


In [None]:
vars_drop_all_missing = [
    "p17a01",
    "p17c01",
    "occupa17",
    "sex17",
]

df = df.drop(columns=vars_drop_all_missing, errors="ignore")

print("After dropping 100% missing variables:", df.shape)


## 3) Drop selected variables with high missingness / low analytical relevance

Based on the earlier decision, we remove these to improve stability and interpretability.


In [None]:
vars_drop_partial_missing = [
    "p17a04",
    "p17c02",
    "p17c08",
]

df = df.drop(columns=vars_drop_partial_missing, errors="ignore")

print("After dropping variables with high missings:", df.shape)


## 4) Sample restriction: adults only (age ≥ 18)

This defines a clear analytical population for a cross-sectional analysis.


In [None]:
# Keep only adults
df = df[df["age17"] >= 18].copy()

print("After age restriction (18+):", df.shape)


## 5) Keep an explicit set of analysis variables

This ensures a clean, well-defined dataset for Step 3 (recoding + descriptives + regressions).


In [None]:
vars_keep = [
    "idpers",
    "idhous17",
    "age17",
    "nationality17",
    "edyear17",
    "isced17",
    "income17",
    "nbpers17",
    "nbkid17",
    "sport17",
    "health17",
    "x17i04",
]

# Keep only columns that exist (safe if you rerun with slightly different Step-1 versions)
vars_keep_existing = [c for c in vars_keep if c in df.columns]
missing_cols = sorted(set(vars_keep) - set(vars_keep_existing))

df_step2 = df[vars_keep_existing].copy()

print("Final dataset shape:", df_step2.shape)
if missing_cols:
    print("WARNING: These expected columns were not found and were skipped:", missing_cols)

df_step2.info()


## 6) Missing-value overview (optional but recommended)

Shows the fraction of missing values per variable.


In [None]:
missing_share = df_step2.isnull().mean().sort_values(ascending=False)
display(missing_share)


## 7) Export cleaned dataset


In [None]:
OUTPUT_PATH = "analysis_dataset_step2.csv"
df_step2.to_csv(OUTPUT_PATH, index=False)

print(f"✅ Cleaned dataset saved as: {OUTPUT_PATH}")
