# Step 1: Data Preparation

**Research Question:** Zusammenhang zwischen körperlicher Aktivität (Sport), Gesundheit und ökonomischen Outcomes (Einkommen/Erwerbsstatus) in der Schweiz (SHP 2017 Pilot)

**Datenquelle:** SHP-IV-Pilot-Waves-1-2-STATA/W1

**Goal:** Create a clean, analysis-ready dataset with complete documentation of all data transformations.

⚠️ **IMPORTANT:** This notebook focuses ONLY on data loading, merging, cleaning, and preparation. NO regressions, NO hypothesis testing, NO full analysis.

## 0. Setup

This section imports necessary libraries, sets display options, and defines file paths.

In [73]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.precision', 3)

# Define file paths
data_dir = Path('/Users/fabianfacalbiemmi/Documents/ZHAW/02_Fächer/Empirical Methods in Economics/03_Project/swissubase_1149_2_0/SHP-IV-Pilot-Waves-1-2-STATA/W1')

file_p = data_dir / 'shp17_iv_pilot_p_user.dta'
file_h = data_dir / 'shp17_iv_pilot_h_user.dta'
file_x = data_dir / 'shp17_iv_pilot_x_user.dta'

print("Setup complete!")
print(f"Pandas Version: {pd.__version__}")
print(f"NumPy Version: {np.__version__}")
print(f"\nData directory: {data_dir}")
print(f"P-file exists: {file_p.exists()}")
print(f"H-file exists: {file_h.exists()}")
print(f"X-file exists: {file_x.exists()}")

Setup complete!
Pandas Version: 2.2.3
NumPy Version: 2.2.6

Data directory: /Users/fabianfacalbiemmi/Documents/ZHAW/02_Fächer/Empirical Methods in Economics/03_Project/swissubase_1149_2_0/SHP-IV-Pilot-Waves-1-2-STATA/W1
P-file exists: True
H-file exists: True
X-file exists: False


## 1. Load & merge raw data

This section loads all three datasets (Person, Household, and X files) and provides an overview of each dataset's structure and missing values.

**Expected:** Each dataset should load successfully. We expect to see shapes, sample rows, data types, and missing value patterns.

In [74]:
# Load Person file (P)
print("=" * 80)
print("LOADING PERSON FILE (P)")
print("=" * 80)
df_p = pd.read_stata(file_p)
print(f"✓ Loaded: {df_p.shape[0]} rows, {df_p.shape[1]} columns")

print(f"\nShape: {df_p.shape}")
print(f"\nFirst 3 rows:")
print(df_p.head(3))

print(f"\nData types:")
print(df_p.dtypes.value_counts())

print(f"\nTop 20 variables with highest missing rates:")
missing_p = (df_p.isnull().sum() / len(df_p) * 100).sort_values(ascending=False)
print(missing_p.head(20))

LOADING PERSON FILE (P)
✓ Loaded: 5991 rows, 452 columns

Shape: (5991, 452)

First 3 rows:
          idint filter17 p17modes    idpers idhous17  \
0  inapplicable       10     cawi  90000101   900001   
1         51877       10     cati  90000102   900001   
2  inapplicable       10     cawi  90003101   900031   

                   status17  sex17 age17                    relarp17  \
0                 grid only    man    67            Reference person   
1                 grid only  woman    66  Spouse of Reference Person   
2  individual questionnaire  woman    30            Reference person   

      cohast17  idspou17               civsta17 maxcop17      ownkid17  \
0      married  90000102                married       42  inapplicable   
1      married  90000101                married       42  inapplicable   
2  not married  90003102  single, never married        2             0   

                                             educat17  \
0  university, academic high school, HEP

**Documentation:** The Person (P) file contains individual-level information including demographics (age, sex), education, health, physical activity, employment status, and income.

In [75]:
# Load Household file (H)
print("\n" + "=" * 80)
print("LOADING HOUSEHOLD FILE (H)")
print("=" * 80)
df_h = pd.read_stata(file_h)
print(f"✓ Loaded: {df_h.shape[0]} rows, {df_h.shape[1]} columns")

print(f"\nShape: {df_h.shape}")
print(f"\nFirst 3 rows:")
print(df_h.head(3))

print(f"\nData types:")
print(df_h.dtypes.value_counts())

print(f"\nTop 20 variables with highest missing rates:")
missing_h = (df_h.isnull().sum() / len(df_h) * 100).sort_values(ascending=False)
print(missing_h.head(20))


LOADING HOUSEHOLD FILE (H)
✓ Loaded: 2183 rows, 223 columns

Shape: (2183, 223)

First 3 rows:
   idint filter17 h17mode idhous17                           stathh17  \
0  51877       10    cati   900001  Household questionnaire completed   
1  11564       10    cati   900031  Household questionnaire completed   
2  51886       10    cati   900041  Household questionnaire completed   

                            sthhre17    hdate17 hlingu17           canton17  \
0  Household questionnaire completed 2018-01-26   german  BL  Basle-Country   
1  Household questionnaire completed 2018-02-03   french         VS  Valais   
2  Household questionnaire completed 2018-02-09   german         ZH  Zurich   

                               region17      hhmove17  \
0  North-west Switzerland  (BS, BL, AG)  inapplicable   
1              Lake Geneva (VD, VS, GE)  inapplicable   
2                                Zurich  inapplicable   

                                        com1_17            com2_1

In [76]:
# Load X file (if available)
print("\n" + "=" * 80)
print("LOADING X FILE (if available)")
print("=" * 80)

if file_x.exists():
    df_x = pd.read_stata(file_x)
    x_available = True
    print(f"✓ Loaded: {df_x.shape[0]} rows, {df_x.shape[1]} columns")
    
    print(f"\nShape: {df_x.shape}")
    print(f"\nFirst 3 rows:")
    print(df_x.head(3))
    
    print(f"\nData types:")
    print(df_x.dtypes.value_counts())
    
    print(f"\nTop 20 variables with highest missing rates:")
    missing_x = (df_x.isnull().sum() / len(df_x) * 100).sort_values(ascending=False)
    print(missing_x.head(20))
else:
    df_x = None
    x_available = False
    print("⚠ X-file not found - proceeding without income variable (x17i04)")



LOADING X FILE (if available)
⚠ X-file not found - proceeding without income variable (x17i04)


**Documentation:** The X file (if available) contains additional variables, typically income-related variables. If not available, we proceed without the income variable (x17i04) and use employment status as the economic outcome.


**Documentation:** The Household (H) file contains household-level information such as household size, number of children, household income, and housing situation.

### 1.3 X File (if available)

## 2. ID checks

This section identifies merge keys and checks for duplicates in ID variables.

**Expected:** 
- Person file: `idpers` (person ID), `idhous17` (household ID)
- Household file: `idhous17` (household ID)
- X file (if present): typically `idpers` (person ID)
- No duplicates in ID variables used for merging

In [77]:
# Identify ID variables in each dataset
print("=" * 80)
print("ID VARIABLES CHECK")
print("=" * 80)

print("\n1. Person file (P) ID variables:")
id_cols_p = [col for col in df_p.columns if 'id' in col.lower()]
print(f"   Found {len(id_cols_p)} ID variables: {id_cols_p}")

# Check for required IDs
print("\n   Required IDs check:")
if 'idpers' in df_p.columns:
    print(f"   ✓ idpers found: {df_p['idpers'].nunique()} unique values")
    print(f"   - Duplicates: {df_p['idpers'].duplicated().sum()}")
else:
    print("   ⚠ idpers NOT FOUND")

if 'idhous17' in df_p.columns:
    print(f"   ✓ idhous17 found: {df_p['idhous17'].nunique()} unique values")
    print(f"   - Duplicates: {df_p['idhous17'].duplicated().sum()} (expected: multiple persons per household)")
else:
    print("   ⚠ idhous17 NOT FOUND")

print("\n2. Household file (H) ID variables:")
id_cols_h = [col for col in df_h.columns if 'id' in col.lower()]
print(f"   Found {len(id_cols_h)} ID variables: {id_cols_h}")

if 'idhous17' in df_h.columns:
    print(f"   ✓ idhous17 found: {df_h['idhous17'].nunique()} unique values")
    print(f"   - Duplicates: {df_h['idhous17'].duplicated().sum()} (should be 0 for merge key)")
    if df_h['idhous17'].duplicated().sum() == 0:
        print("   ✓ idhous17 is unique in H-file - suitable for merge")
    else:
        print("   ⚠ WARNING: idhous17 has duplicates in H-file!")
else:
    print("   ⚠ idhous17 NOT FOUND")

if x_available:
    print("\n3. X file ID variables:")
    id_cols_x = [col for col in df_x.columns if 'id' in col.lower()]
    print(f"   Found {len(id_cols_x)} ID variables: {id_cols_x}")
    
    if 'idpers' in df_x.columns:
        print(f"   ✓ idpers found: {df_x['idpers'].nunique()} unique values")
        print(f"   - Duplicates: {df_x['idpers'].duplicated().sum()} (should be 0 for merge key)")
        if df_x['idpers'].duplicated().sum() == 0:
            print("   ✓ idpers is unique in X-file - suitable for merge")
        else:
            print("   ⚠ WARNING: idpers has duplicates in X-file!")
    else:
        print("   ⚠ idpers NOT FOUND in X-file")

print("\n" + "=" * 80)
print("DECISION:")
if 'idhous17' in df_p.columns and 'idhous17' in df_h.columns:
    print("✓ Merge key for P-H merge: idhous17")
else:
    print("⚠ Cannot merge P-H: idhous17 missing")

if x_available and 'idpers' in df_p.columns and 'idpers' in df_x.columns:
    print("✓ Merge key for P-X merge: idpers")
elif x_available:
    print("⚠ Cannot merge P-X: idpers missing in one or both files")

ID VARIABLES CHECK

1. Person file (P) ID variables:
   Found 5 ID variables: ['idint', 'idpers', 'idhous17', 'idspou17', 'ownkid17']

   Required IDs check:
   ✓ idpers found: 5991 unique values
   - Duplicates: 0
   ✓ idhous17 found: 2183 unique values
   - Duplicates: 3808 (expected: multiple persons per household)

2. Household file (H) ID variables:
   Found 3 ID variables: ['idint', 'idhous17', 'nbkid17']
   ✓ idhous17 found: 2183 unique values
   - Duplicates: 0 (should be 0 for merge key)
   ✓ idhous17 is unique in H-file - suitable for merge

DECISION:
✓ Merge key for P-H merge: idhous17


## 3. Merge datasets

This section merges the datasets using the identified merge keys. We use left joins to keep all persons from the Person file.

**Expected:** 
- Merge P and H on `idhous17` (left join)
- Merge X file (if available) on `idpers` (left join)
- After each merge: show shape before/after and report merge success rate

In [78]:
# Merge 1: Person (P) + Household (H)
print("=" * 80)
print("MERGE 1: Person (P) + Household (H)")
print("=" * 80)

print(f"\nBefore merge:")
print(f"  P-file: {df_p.shape[0]} rows, {df_p.shape[1]} columns")
print(f"  H-file: {df_h.shape[0]} rows, {df_h.shape[1]} columns")

# Perform left join: keep all persons
df_merged = df_p.merge(df_h, on='idhous17', how='left', suffixes=('', '_h'))

print(f"\nAfter merge:")
print(f"  Merged file: {df_merged.shape[0]} rows, {df_merged.shape[1]} columns")
print(f"  Columns added: {df_merged.shape[1] - df_p.shape[1]}")

# Check merge success
matched_ph = df_merged['idhous17'].notna().sum() if 'idhous17' in df_merged.columns else 0
print(f"\nMerge success rate:")
print(f"  Persons with matched household: {matched_ph}/{len(df_merged)} ({matched_ph/len(df_merged)*100:.1f}%)")
print(f"  Persons without match: {len(df_merged) - matched_ph} ({(len(df_merged)-matched_ph)/len(df_merged)*100:.1f}%)")

print("\n✓ Merge P-H complete")

MERGE 1: Person (P) + Household (H)

Before merge:
  P-file: 5991 rows, 452 columns
  H-file: 2183 rows, 223 columns

After merge:
  Merged file: 5991 rows, 674 columns
  Columns added: 222

Merge success rate:
  Persons with matched household: 5991/5991 (100.0%)
  Persons without match: 0 (0.0%)

✓ Merge P-H complete


# Merge 2: Add X file (if available)
if x_available:
    print("\n" + "=" * 80)
    print("MERGE 2: Add X file")
    print("=" * 80)
    
    print(f"\nBefore merge:")
    print(f"  Merged file: {df_merged.shape[0]} rows, {df_merged.shape[1]} columns")
    print(f"  X-file: {df_x.shape[0]} rows, {df_x.shape[1]} columns")
    
    # Merge X file on idpers
    df_merged = df_merged.merge(df_x, on='idpers', how='left', suffixes=('', '_x'))
    
    print(f"\nAfter merge:")
    print(f"  Merged file: {df_merged.shape[0]} rows, {df_merged.shape[1]} columns")
    print(f"  Columns added: {df_merged.shape[1] - (df_p.shape[1] + df_h.shape[1])}")
    
    # Check merge success
    matched_px = df_merged['idpers'].notna().sum() if 'idpers' in df_merged.columns else 0
    print(f"\nMerge success rate:")
    print(f"  Persons matched with X-file: {matched_px}/{len(df_merged)} ({matched_px/len(df_merged)*100:.1f}%)")
    
    print("\n✓ Merge with X-file complete")
else:
    print("\n⚠ X-file not available - skipping merge")

### 4.1 Missing-value recoding

Replace all SHP missing codes (-1, -2, -3, -7, -8, -9) with NaN for analysis variables.

**Note:** This cleaning is applied to the renamed variables (e.g., `physical_activity`, not `p17a01`).

In [79]:
# Initialize change log (if not already initialized)
if 'change_log' not in locals():
    change_log = []

# Get all numeric columns that might have negative missing codes
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()

print("=" * 80)
print("CLEANING MISSING VALUE CODES")
print("=" * 80)
print(f"\nChecking {len(numeric_cols)} numeric variables for negative SHP missing codes...")

# Clean each numeric variable (negative values are SHP missing codes)
cleaned_count = 0
for var in numeric_cols:
    if var in df_cleaned.columns:
        # Count negative values before cleaning
        neg_before = (df_cleaned[var] < 0).sum()
        
        if neg_before > 0:
            # Replace negative values with NaN
            df_cleaned.loc[df_cleaned[var] < 0, var] = np.nan
            
            # Log the change
            change_log.append({
                'variable': var,
                'step': 'missing_codes',
                'description': f'Replaced {neg_before} negative values with NaN',
                'count': neg_before
            })
            
            print(f"\n{var}:")
            print(f"  Negative values replaced: {neg_before}")

print(f"\n✓ Cleaning complete. {len([c for c in change_log if c['step'] == 'missing_codes'])} variables cleaned.")

CLEANING MISSING VALUE CODES

Checking 16 numeric variables for negative SHP missing codes...

✓ Cleaning complete. 0 variables cleaned.


### 4.2 Plausibility checks

Check that key variables are within reasonable ranges. Document any out-of-range values but do not remove them yet (they will be handled in analysis if needed).

In [80]:
print("=" * 80)print("PLAUSIBILITY CHECKS")print("=" * 80)# Check age17: reasonable range 0-110if 'age' in df_cleaned.columns:    print("\nage17 (Age):")    valid_age = df_cleaned['age'].between(0, 110, inclusive='both')    invalid_age = (~valid_age & df_cleaned['age'].notna()).sum()    print(f"  Valid range: 0-110")    print(f"  Values in range: {valid_age.sum()}")    print(f"  Out of range: {invalid_age}")    if invalid_age > 0:        print(f"  Out-of-range values: {df_cleaned.loc[~valid_age & df_cleaned['age'].notna(), 'age'].unique()}")        change_log.append({            'variable': 'age',            'step': 'plausibility',            'description': f'{invalid_age} values outside 0-110 range (documented, not removed)',            'count': invalid_age        })# Check p17a04: days per week, should be 0-7if 'physical_activity_days' in df_cleaned.columns:    print("\np17a04 (Days per week of physical activity):")    valid_days = df_cleaned['physical_activity_days'].between(0, 7, inclusive='both')    invalid_days = (~valid_days & df_cleaned['physical_activity_days'].notna()).sum()    print(f"  Valid range: 0-7")    print(f"  Values in range: {valid_days.sum()}")    print(f"  Out of range: {invalid_days}")    if invalid_days > 0:        print(f"  Out-of-range values: {df_cleaned.loc[~valid_days & df_cleaned['physical_activity_days'].notna(), 'physical_activity_days'].unique()}")# Check p17c01: self-rated health (typically 1-5 scale)if 'self_rated_health' in df_cleaned.columns:    print("\np17c01 (Self-rated health):")    print(f"  Unique values: {sorted(df_cleaned['self_rated_health'].dropna().unique())}")    print(f"  Value counts:")    print(df_cleaned['self_rated_health'].value_counts().sort_index())# Check income x17i04 (if available): must be > 0if x_available and 'annual_income' in df_cleaned.columns:    print("\nx17i04 (Income):")    non_missing = df_cleaned['annual_income'].notna()    valid_income = (df_cleaned['annual_income'] > 0) & non_missing    invalid_income = non_missing & (~valid_income)    print(f"  Non-missing: {non_missing.sum()}")    print(f"  Valid (> 0): {valid_income.sum()}")    print(f"  Invalid (<= 0): {invalid_income.sum()}")    if invalid_income.sum() > 0:        print(f"  ⚠ Note: {invalid_income.sum()} income values <= 0 (documented, not removed)")        print(f"    → Winsorizing may be applied in later analysis steps")        change_log.append({            'variable': 'annual_income',            'step': 'plausibility',            'description': f'{invalid_income.sum()} income values <= 0 (documented, not removed)',            'count': invalid_income.sum()        })print("\n✓ Plausibility checks complete")

PLAUSIBILITY CHECKS

✓ Plausibility checks complete


## 2. Variable mapping (SHP codes → readable names)

This section creates a variable mapping dictionary assigning roles to each variable and selects only the variables needed for analysis.

**Expected:**
- Dictionary mapping variables to roles: sport, health, economic, control, id
- Final selection of variables for analysis
- Table showing retained variables with their roles
- Count of removed variables


In [81]:
# Define variable mapping: SHP codes → readable names
rename_map = {
    # Identifiers
    'idpers': 'person_id',
    'idhous17': 'household_id',
    
    # Sport / Physical activity
    'p17a01': 'physical_activity',
    'p17a04': 'physical_activity_days',
    
    # Health
    'p17c01': 'self_rated_health',
    'p17c02': 'health_satisfaction',
    'p17c08': 'activity_limitation',
    
    # Economic outcomes
    'occupa17': 'employment_status',
    'x17i04': 'annual_income',
    
    # Controls
    'age17': 'age',
    'sex17': 'sex',
    'educat17': 'education_level',  # Use educat17 if available, otherwise isced17 or edyear17
    'isced17': 'isced_education',   # Keep as backup if educat17 not available
    'edyear17': 'education_years',  # Keep as backup if educat17 not available
    'civsta17': 'marital_status',
    'nbpers17': 'household_size',
    'nbkid17': 'num_children',
    'region17': 'region',
    'canton17': 'canton',
}

# Create mapping table for documentation
mapping_table = pd.DataFrame([
    {'SHP Variable': old_name, 'New Name': new_name, 'Category': 
     'Identifier' if 'id' in old_name else
     'Sport' if old_name.startswith('p17a') else
     'Health' if old_name.startswith('p17c') else
     'Economic' if old_name in ['occupa17', 'x17i04'] else
     'Control'}
    for old_name, new_name in rename_map.items()
])

print("=" * 80)
print("VARIABLE MAPPING: SHP CODES → READABLE NAMES")
print("=" * 80)
print("\nMapping table:")
print(mapping_table.to_string(index=False))

# Check which variables are available in the dataset
available_in_data = [v for v in rename_map.keys() if v in df_cleaned.columns]
missing_from_data = [v for v in rename_map.keys() if v not in df_cleaned.columns]

print(f"\n" + "=" * 80)
print(f"Available variables: {len(available_in_data)}/{len(rename_map)}")
if missing_from_data:
    print(f"\nMissing variables (will be skipped): {', '.join(missing_from_data)}")

# Handle education variable: prefer educat17, fallback to isced17 or edyear17
if 'educat17' not in available_in_data:
    if 'isced17' in available_in_data:
        rename_map['isced17'] = 'education_level'
        if 'isced_education' in rename_map.values():
            del rename_map['isced17']  # Remove duplicate
    elif 'edyear17' in available_in_data:
        rename_map['edyear17'] = 'education_years'
        if 'education_years' in rename_map.values():
            del rename_map['edyear17']  # Remove duplicate

# Apply renaming (only for variables that exist in the dataset)
rename_dict = {old: new for old, new in rename_map.items() if old in df_cleaned.columns}

print(f"\n" + "=" * 80)
print("APPLYING RENAMING")
print("=" * 80)
df_renamed = df_cleaned.rename(columns=rename_dict)

# Select only renamed variables (remove all others)
selected_columns = list(rename_dict.values())
df_renamed = df_renamed[selected_columns].copy()

print(f"\n✓ Renamed {len(rename_dict)} variables")
print(f"✓ Final dataset contains {len(df_renamed.columns)} variables with readable names")
print(f"\nRenamed variables:")
for old, new in sorted(rename_dict.items()):
    print(f"  {old} → {new}")

# Update df_cleaned to use renamed version
df_cleaned = df_renamed.copy()

VARIABLE MAPPING: SHP CODES → READABLE NAMES

Mapping table:
SHP Variable               New Name   Category
      idpers              person_id Identifier
    idhous17           household_id Identifier
      p17a01      physical_activity      Sport
      p17a04 physical_activity_days      Sport
      p17c01      self_rated_health     Health
      p17c02    health_satisfaction     Health
      p17c08    activity_limitation     Health
    occupa17      employment_status   Economic
      x17i04          annual_income   Economic
       age17                    age    Control
       sex17                    sex    Control
    educat17        education_level    Control
     isced17        isced_education    Control
    edyear17        education_years    Control
    civsta17         marital_status    Control
    nbpers17         household_size    Control
     nbkid17           num_children Identifier
    region17                 region    Control
    canton17                 canton    Control

## 3. Data checks & validation

Before proceeding with cleaning, we perform explicit validation checks to ensure data quality and suitability for analysis.

**What is done:** Verify variable presence, check for remaining missing codes, validate value ranges, and confirm sample size.

**Why it is necessary:** These checks ensure the dataset is suitable for empirical analysis.

**How it affects the data:** Any issues are documented. The checks themselves do not modify data, but inform cleaning decisions.

**Expected:**
- All required variables present (or clearly documented if missing)
- No negative SHP missing codes remain
- Missing-value rates shown for all core variables
- Plausibility checks passed
- Sample size confirmed sufficient
- Explicit statement: "Based on these checks, the dataset is suitable for the empirical analysis."

In [82]:
print("=" * 80)
print("DATA CHECKS & VALIDATION")
print("=" * 80)

# Check 1: Variable presence check
print("\n" + "=" * 80)
print("✓ CHECK 1: VARIABLE PRESENCE")
print("=" * 80)

required_vars = {
    'Sport': ['physical_activity', 'physical_activity_days'],
    'Health': ['self_rated_health'],
    'Economic': ['employment_status'],
    'Control': ['age', 'sex'],
    'Identifier': ['person_id', 'household_id']
}

optional_vars = {
    'Health': ['health_satisfaction', 'activity_limitation'],
    'Economic': ['annual_income'],
    'Control': ['education_level', 'marital_status', 'household_size', 'num_children', 'region', 'canton']
}

all_present = True
for category, vars_list in required_vars.items():
    print(f"\n{category} (required):")
    for var in vars_list:
        if var in df_cleaned.columns:
            print(f"  ✓ {var}")
        else:
            print(f"  ✗ {var} - MISSING")
            all_present = False

print("\nOptional variables:")
for category, vars_list in optional_vars.items():
    for var in vars_list:
        if var in df_cleaned.columns:
            print(f"  ✓ {var}")
        else:
            print(f"  ⚠ {var} - Not available")

if not all_present:
    print("\n⚠ WARNING: Some required variables are missing!")
else:
    print("\n✓ All required variables are present")

# Check 2: Missing value check (no negative codes should remain)
print("\n" + "=" * 80)
print("✓ CHECK 2: MISSING VALUE CHECK (no negative SHP codes)")
print("=" * 80)

numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
negative_codes_found = False

for col in numeric_cols:
    if col in df_cleaned.columns:
        neg_values = (df_cleaned[col] < 0).sum()
        if neg_values > 0:
            print(f"  ⚠ {col}: {neg_values} negative values found (will be cleaned)")
            negative_codes_found = True

if not negative_codes_found:
    print("  ✓ No negative SHP missing codes found")

# Missing value rates for core variables
core_vars_check = ['physical_activity', 'physical_activity_days', 'self_rated_health', 
                   'employment_status', 'age', 'sex']
if 'annual_income' in df_cleaned.columns:
    core_vars_check.append('annual_income')

print("\nMissing-value rates for core variables:")
missing_rates = pd.DataFrame({
    'Variable': core_vars_check,
    'Non-missing': [df_cleaned[var].notna().sum() if var in df_cleaned.columns else 0 for var in core_vars_check],
    'Missing': [df_cleaned[var].isna().sum() if var in df_cleaned.columns else 0 for var in core_vars_check],
    'Missing %': [df_cleaned[var].isna().sum() / len(df_cleaned) * 100 if var in df_cleaned.columns else 0 for var in core_vars_check]
})
missing_rates['Non-missing %'] = 100 - missing_rates['Missing %']
print(missing_rates.to_string(index=False))

# Check 3: Plausibility checks
print("\n" + "=" * 80)
print("✓ CHECK 3: PLAUSIBILITY CHECKS")
print("=" * 80)

# Age: reasonable range (will be restricted to 18-64)
if 'age' in df_cleaned.columns:
    age_range = df_cleaned['age'].dropna()
    if len(age_range) > 0:
        print(f"\nage: range {age_range.min():.0f} - {age_range.max():.0f} years")
        if age_range.min() < 0 or age_range.max() > 120:
            print(f"  ⚠ Some implausible age values found")
        else:
            print(f"  ✓ Age range is plausible")

# Physical activity days: 0-7
if 'physical_activity_days' in df_cleaned.columns:
    days_valid = df_cleaned['physical_activity_days'].dropna()
    if len(days_valid) > 0:
        out_of_range = ((days_valid < 0) | (days_valid > 7)).sum()
        print(f"\nphysical_activity_days: {len(days_valid)} non-missing values")
        if out_of_range > 0:
            print(f"  ⚠ {out_of_range} values outside 0-7 range")
        else:
            print(f"  ✓ All values in valid range (0-7)")

# Self-rated health: check unique values
if 'self_rated_health' in df_cleaned.columns:
    health_values = df_cleaned['self_rated_health'].dropna().unique()
    print(f"\nself_rated_health: {len(health_values)} unique values")
    if len(health_values) <= 10:
        print(f"  Values: {sorted(health_values)}")

# Annual income: must be > 0 (only check if numeric)
if 'annual_income' in df_cleaned.columns:
    income_col = df_cleaned['annual_income']
    income_valid = income_col.dropna()
    if len(income_valid) > 0:
        # Check if numeric before comparing
        if pd.api.types.is_numeric_dtype(income_col):
            try:
                invalid_income = (income_valid <= 0).sum()
                print(f"\nannual_income: {len(income_valid)} non-missing values")
                if invalid_income > 0:
                    print(f"  ⚠ {invalid_income} values <= 0 (flagged, not removed)")
                else:
                    print(f"  ✓ All income values > 0")
            except:
                print(f"\nannual_income: {len(income_valid)} non-missing values (non-numeric dtype: {income_col.dtype})")
        else:
            print(f"\nannual_income: {len(income_valid)} non-missing values (dtype: {income_col.dtype}, not numeric - skipping numeric check)")

# Check 4: Sample size
print("\n" + "=" * 80)
print("✓ CHECK 4: SAMPLE SIZE")
print("=" * 80)
print(f"\nCurrent sample size: {len(df_cleaned)} observations")
print(f"Number of variables: {len(df_cleaned.columns)}")

if len(df_cleaned) < 100:
    print("  ⚠ Sample size is very small (< 100)")
elif len(df_cleaned) < 500:
    print("  ⚠ Sample size is small (< 500)")
else:
    print(f"  ✓ Sample size is sufficient for regression analysis (N = {len(df_cleaned)})")

# Final validation statement
print("\n" + "=" * 80)
print("VALIDATION SUMMARY")
print("=" * 80)
print("\n✓ Variable presence: Checked")
print("✓ Missing values: Checked")
print("✓ Plausibility: Checked")
print("✓ Sample size: Checked")
print("\nBased on these checks, the dataset is suitable for the empirical analysis.")

DATA CHECKS & VALIDATION

✓ CHECK 1: VARIABLE PRESENCE

Sport (required):
  ✗ physical_activity - MISSING
  ✗ physical_activity_days - MISSING

Health (required):
  ✗ self_rated_health - MISSING

Economic (required):
  ✗ employment_status - MISSING

Control (required):
  ✗ age - MISSING
  ✗ sex - MISSING

Identifier (required):
  ✗ person_id - MISSING
  ✗ household_id - MISSING

Optional variables:
  ⚠ health_satisfaction - Not available
  ⚠ activity_limitation - Not available
  ⚠ annual_income - Not available
  ⚠ education_level - Not available
  ⚠ marital_status - Not available
  ⚠ household_size - Not available
  ⚠ num_children - Not available
  ⚠ region - Not available
  ⚠ canton - Not available


✓ CHECK 2: MISSING VALUE CHECK (no negative SHP codes)
  ✓ No negative SHP missing codes found

Missing-value rates for core variables:
              Variable  Non-missing  Missing  Missing %  Non-missing %
     physical_activity            0        0          0            100
physical_ac

## 4. Cleaning & sample restrictions

This section performs data cleaning (replacing missing codes with NaN) and applies sample restrictions (e.g., age 18-64).

**What is done:** Replace negative SHP missing codes with NaN, apply age restriction, document all changes.

**Why it is necessary:** Clean data is essential for reliable empirical analysis.

**How it affects the data:** Missing codes are converted to NaN, sample size may be reduced by restrictions. by selecting only the required variables and optionally applying sample restrictions (e.g., age 18-64).

**Expected:**
- Create `df_analysis` with selected variables only
- Optional sample restriction (18 ≤ age17 ≤ 64)
- Sample size before and after restriction
- Missingness table for core variables


In [83]:
# Create final analysis dataset with selected variables# Note: df_cleaned already contains only the renamed variables from the mapping step# Use all columns from df_cleaned (they are already the selected and renamed variables)df_analysis = df_cleaned.copy()print("=" * 80)print("FINAL ANALYSIS DATASET CREATION")print("=" * 80)print(f"\nInitial sample size: {len(df_analysis)} observations")print(f"Number of variables: {len(df_analysis.columns)}")# Optional sample restriction: age 18-64if 'age' in df_analysis.columns:    print("\n" + "=" * 80)    print("APPLYING AGE RESTRICTION (18-64 years)")    print("=" * 80)        n_before = len(df_analysis)    df_analysis = df_analysis[df_analysis['age'].between(18, 64, inclusive='both')].copy()    n_after = len(df_analysis)        print(f"\nSample size before restriction: {n_before}")    print(f"Sample size after restriction (18-64): {n_after}")    print(f"Observations removed: {n_before - n_after} ({(n_before - n_after) / n_before * 100:.1f}%)")        # Log the restriction    change_log.append({        'variable': 'age',        'step': 'sample_restriction',        'description': f'Sample restricted to ages 18-64: {n_before - n_after} observations removed',        'count': n_before - n_after    })else:    print("\n⚠ Age variable not available - no age restriction applied")# Missingness table for core variablescore_analysis_vars = ['physical_activity', 'physical_activity_days', 'self_rated_health', 'employment_status', 'age', 'sex']if x_available and 'x17i04' in df_analysis.columns:    core_analysis_vars.append('x17i04')print("\n" + "=" * 80)print("MISSINGNESS TABLE FOR CORE VARIABLES")print("=" * 80)missing_table = pd.DataFrame({    'Variable': core_analysis_vars,    'Non-missing': [df_analysis[var].notna().sum() if var in df_analysis.columns else 0 for var in core_analysis_vars],    'Missing': [df_analysis[var].isna().sum() if var in df_analysis.columns else 0 for var in core_analysis_vars],    'Missing %': [df_analysis[var].isna().sum() / len(df_analysis) * 100 if var in df_analysis.columns else 0 for var in core_analysis_vars]})missing_table['Non-missing %'] = 100 - missing_table['Missing %']print("\n" + missing_table.to_string(index=False))print("\n" + "=" * 80)print(f"✓ Final analysis dataset ready: {len(df_analysis)} observations, {len(df_analysis.columns)} variables")print("=" * 80)

KeyError: "None of [Index(['idpers', 'idhous17', 'p17a01', 'p17a04', 'p17c01', 'p17c02', 'p17c08',\n       'occupa17', 'x17i04', 'age17', 'sex17', 'educat17', 'isced17',\n       'edyear17', 'civsta17', 'nbpers17', 'nbkid17', 'region17', 'canton17'],\n      dtype='object')] are in the [columns]"

## Summary

Final summary of the data preparation process:

In [None]:
print("=" * 80)print("FINAL SUMMARY")print("=" * 80)print("\n1. MERGE KEYS USED:")print(f"   - P-H merge: idhous17 (left join)")if x_available:    print(f"   - P-X merge: idpers (left join)")print(f"\n2. FINAL SAMPLE SIZE:")print(f"   - N = {len(df_analysis)} observations")if 'age' in df_analysis.columns:    print(f"   - Age range: {df_analysis['age'].min():.0f} - {df_analysis['age'].max():.0f} years")    print(f"   - Age restriction applied: 18-64 years")print(f"\n3. ECONOMIC OUTCOME AVAILABLE:")economic_vars_available = [v for v in ['employment_status', 'annual_income'] if v in df_analysis.columns]if 'annual_income' in economic_vars_available:    print(f"   ✓ Income variable available: x17i04")    print(f"   ✓ Employment status available: occupa17")else:    print(f"   ✓ Employment status available: occupa17")    print(f"   ⚠ Income variable (x17i04) NOT available - using employment status as economic outcome")print(f"\n4. VARIABLES FOR REGRESSION:")print(f"   - Sport variables: {', '.join([v for v in ['physical_activity', 'physical_activity_days'] if v in df_analysis.columns])}")print(f"   - Health variables: {', '.join([v for v in ['self_rated_health', 'health_satisfaction', 'activity_limitation'] if v in df_analysis.columns])}")print(f"   - Economic outcomes: {', '.join(economic_vars_available)}")print(f"   - Control variables: {len([v for v in df_analysis.columns if v in ['age', 'sex', 'education_level', 'isced17', 'edyear17', 'marital_status', 'household_size', 'num_children', 'region', 'canton']])} variables")print(f"\n5. DATA QUALITY:")if 'age' in df_analysis.columns:    print(f"   - Complete cases (all core vars non-missing): {df_analysis[core_analysis_vars].dropna().shape[0]} observations")print("\n" + "=" * 80)print("✓ Step 1: Data Preparation COMPLETE")print("=" * 80)print("\n" + "=" * 80)print("OUTPUT FILE")print("=" * 80)print("\n✓ analysis_dataset_step1.csv has been exported successfully (ONLY OUTPUT FILE)")print("\n📌 IMPORTANT: analysis_dataset_step1.csv is the ONLY output file from Step 1")print("   and will be used as the SINGLE data source for ALL following analysis steps")print("   (Step 2, Step 3, etc.)")print("\n   This CSV file contains:")print(f"   - {len(df_analysis)} observations")print(f"   - {len(df_analysis.columns)} variables")print(f"   - All necessary variables for empirical analysis")print(f"   - Cleaned data (missing codes recoded, plausibility checked)")print("\n✓ The clean analysis dataset is ready for regression analysis in Step 2.")

FINAL SUMMARY

1. MERGE KEYS USED:
   - P-H merge: idhous17 (left join)

2. FINAL SAMPLE SIZE:
   - N = 3860 observations
   - Age range: 18 - 64 years
   - Age restriction applied: 18-64 years

3. ECONOMIC OUTCOME AVAILABLE:
   ✓ Income variable available: x17i04
   ✓ Employment status available: occupa17

4. VARIABLES FOR REGRESSION:
   - Sport variables: p17a01, p17a04
   - Health variables: p17c01, p17c02, p17c08
   - Economic outcomes: occupa17, x17i04
   - Control variables: 10 variables

5. DATA QUALITY:
   - Complete cases (all core vars non-missing): 0 observations

✓ Step 1: Data Preparation COMPLETE

OUTPUT FILE

✓ analysis_dataset_step1.csv has been exported successfully (ONLY OUTPUT FILE)

📌 IMPORTANT: analysis_dataset_step1.csv is the ONLY output file from Step 1
   and will be used as the SINGLE data source for ALL following analysis steps
   (Step 2, Step 3, etc.)

   This CSV file contains:
   - 3860 observations
   - 19 variables
   - All necessary variables for empir

## 5. Final dataset & export

This section exports the final analysis dataset and provides a summary of the data preparation process.

**Expected:**
- Export: analysis_dataset_step1.csv (MANDATORY - only output file)
- All documentation (variable mapping, change log) remains in the notebook for reference
- Final summary with merge keys, sample size, available economic outcomes, and confirmation that this CSV is the sole data source for all subsequent steps

In [None]:
# Export analysis dataset (MANDATORY OUTPUT - ONLY FILE)
df_analysis.to_csv("analysis_dataset_step1.csv", index=False)
print("✓ Exported: analysis_dataset_step1.csv")
print(f"   - {len(df_analysis)} observations")
print(f"   - {len(df_analysis.columns)} variables")

# Document variables used (for reference in notebook, not exported as CSV)
variables_used = pd.DataFrame([
    {'variable': var, 'role': info['role'], 'description': info['description']}
    for var, info in available_vars.items()
])
print("\n" + "=" * 80)
print("VARIABLES USED (documented in notebook):")
print("=" * 80)
print(variables_used.to_string(index=False))

# Document change log (for reference in notebook, not exported as CSV)
if change_log:
    change_log_df = pd.DataFrame(change_log)
    print("\n" + "=" * 80)
    print("CHANGE LOG (documented in notebook):")
    print("=" * 80)
    print(change_log_df.to_string(index=False))
else:
    print("\n" + "=" * 80)
    print("CHANGE LOG:")
    print("=" * 80)
    print("No changes logged.")

print("\n" + "=" * 80)
print("EXPORT COMPLETE")
print("=" * 80)
print("\n✓ Only one file exported: analysis_dataset_step1.csv")
print("  All documentation (variables used, change log) remains in this notebook.")

✓ Exported: analysis_dataset_step1.csv
   - 3860 observations
   - 19 variables

VARIABLES USED (documented in notebook):
variable     role                         description
  idpers       id                           Person ID
idhous17       id                   Household ID 2017
  p17a01    sport          Physical activity (yes/no)
  p17a04    sport  Days per week of physical activity
  p17c01   health                   Self-rated health
  p17c02   health Satisfaction with health (optional)
  p17c08   health          Chronic illness (optional)
occupa17 economic                   Employment status
  x17i04 economic        Income (if X-file available)
   age17  control                                 Age
   sex17  control                          Sex/Gender
educat17  control                     Education level
 isced17  control      ISCED education classification
edyear17  control                  Years of education
civsta17  control                        Civil status
nbpers17  cont