# Sociopathit Data Utilities - Comprehensive Test Suite

This notebook demonstrates and tests all data utility modules in the sociopathit package.

## ðŸ“š Table of Contents

Click to jump to any section:

1. **[Setup & Initialization](#setup)** - Environment configuration
2. **[File Discovery](#discovery)** - Finding and scanning data files
3. **[Data Loading](#loading)** - Loading surveys and Stata files
4. **[Metadata Utilities](#metadata)** - Survey summaries and structure
5. **[Longitudinal Data](#longitudinal)** - Panel data management
6. **[Harmonization](#preparation)** - Data preparation and cleaning

---

## ðŸš€ Quick Start

```python
# Example: Load and harmonize multiple survey waves
from sociopathit.data.loading import load_all_surveys
from sociopathit.data.preparation import build_harmonized_dataset

# Load all surveys
surveys = load_all_surveys('data/', target_vars=['age', 'income', 'satisfaction'])

# Build harmonized long-form dataset
df_long = build_harmonized_dataset(surveys, target_vars=['age', 'income', 'satisfaction'])
```

---

In [1]:
import os
import sys
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import TemporaryDirectory

# Locate and add the package root
cwd = Path.cwd().resolve()
for parent in [cwd, *cwd.parents]:
    if (parent / "sociopathit").exists():
        ROOT = parent
        break
else:
    raise FileNotFoundError("Could not locate the sociopathit package root.")

# Add to sys.path for imports
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print(f"Added to sys.path: {ROOT}")

Added to sys.path: C:\Users\alecw\OneDrive - University of Toronto\Directives\GITTYSBURG\sociopathit


<a id='discovery'></a>
# 2. File Discovery & Path Handling

Test functions for discovering and scanning data files in directory structures.

In [2]:
import importlib
from sociopathit.data import discovery as discovery_module
importlib.reload(discovery_module)
from sociopathit.data.discovery import resolve_qwels_root, discover_data, find_latest_wave

print("Discovery module loaded successfully")

Discovery module loaded successfully


In [3]:
# Test 1: resolve_qwels_root
print("=" * 60)
print("TEST 1: Resolve QWELS Root Directory")
print("=" * 60)

try:
    root = resolve_qwels_root()
    print(f"Found root directory: {root}")
except FileNotFoundError as e:
    print(f"Could not find root (expected if not in QWELS structure): {e}")

TEST 1: Resolve QWELS Root Directory
Found root directory: C:\Users\alecw\OneDrive - University of Toronto\Directives\QWELS


In [4]:
# Test 2: discover_data with simulated data
print("=" * 60)
print("TEST 2: Discover Data Files")
print("=" * 60)

# Create temporary directory with test data
with TemporaryDirectory() as tmpdir:
    tmpdir = Path(tmpdir)
    
    # Create some test CSV files
    for year in [2020, 2021, 2022]:
        df_test = pd.DataFrame({
            'pid': range(1, 101),
            'age': np.random.randint(18, 65, 100),
            'income': np.random.randint(30000, 100000, 100),
            'satisfaction': np.random.choice(['Low', 'Medium', 'High'], 100)
        })
        df_test.to_csv(tmpdir / f"survey_{year}.csv", index=False)
    
    # Create a test file (should be excluded)
    df_test.to_csv(tmpdir / "survey_2023_test.csv", index=False)
    
    # Discover data
    data = discover_data(
        tmpdir,
        file_types=['csv'],
        exclude_patterns=['*_test*'],
        recursive=False
    )
    
    print(f"\nDiscovered {len(data)} datasets:")
    for survey_id, df in data.items():
        print(f"  - {survey_id}: {len(df)} rows, {len(df.columns)} columns")
        print(f"    Columns: {list(df.columns)}")

TEST 2: Discover Data Files
âœ“ Loaded 3 datasets
  File types: csv=3
âš  Skipped 1 files
  - survey_2023_test.csv: matched exclusion pattern

Discovered 3 datasets:
  - survey_2020: 100 rows, 4 columns
    Columns: ['pid', 'age', 'income', 'satisfaction']
  - survey_2021: 100 rows, 4 columns
    Columns: ['pid', 'age', 'income', 'satisfaction']
  - survey_2022: 100 rows, 4 columns
    Columns: ['pid', 'age', 'income', 'satisfaction']


In [5]:
# Test 3: find_latest_wave
print("=" * 60)
print("TEST 3: Find Latest Wave")
print("=" * 60)

# Create test data with different waves
test_data = {
    'survey_2020': pd.DataFrame({'pid': range(1, 51), 'value': np.random.random(50)}),
    'survey_2021': pd.DataFrame({'pid': range(1, 51), 'value': np.random.random(50)}),
    'survey_2022': pd.DataFrame({'pid': range(1, 51), 'value': np.random.random(50)}),
}

wave_info = find_latest_wave(test_data, name_pattern=r'(\d{4})')

print(f"\nLatest wave: {wave_info['latest'][0]}")
print(f"Earliest wave: {wave_info['earliest'][0]}")
print(f"\nSorted waves:")
for survey_id, year in wave_info['sorted_waves']:
    print(f"  {survey_id} ({year}) -> {wave_info['wave_labels'][survey_id]}")

TEST 3: Find Latest Wave

Latest wave: survey_2022
Earliest wave: survey_2020

Sorted waves:
  survey_2020 (2020) -> wave_1
  survey_2021 (2021) -> wave_2
  survey_2022 (2022) -> wave_3


<a id='loading'></a>
# 3. Data Loading & Preparation

Test functions for loading survey data and Stata (.dta) files with automatic preprocessing.

In [6]:
import importlib
from sociopathit.data import loading as loading_module
importlib.reload(loading_module)
from sociopathit.data.loading import load_stata, load_all_surveys

print("Loading module loaded successfully")

Loading module loaded successfully


In [7]:
# Test 4: load_all_surveys
print("=" * 60)
print("TEST 4: Load All Surveys")
print("=" * 60)

with TemporaryDirectory() as tmpdir:
    tmpdir = Path(tmpdir)
    
    # Create test data files
    for year in [2020, 2021, 2022]:
        df_test = pd.DataFrame({
            'PID': range(1, 101),
            'Age': np.random.randint(18, 65, 100),
            'Income': np.random.randint(30000, 100000, 100),
            'Job Satisfaction': np.random.choice(['Low', 'Medium', 'High'], 100),
            'Extra_Col': np.random.random(100)
        })
        df_test.to_csv(tmpdir / f"survey_{year}.csv", index=False)
    
    # Load all surveys
    surveys = load_all_surveys(
        tmpdir,
        file_extensions=['.csv'],
        target_vars=['pid', 'age', 'income', 'job_satisfaction']
    )
    
    print(f"\nLoaded {len(surveys)} surveys:")
    for survey_id, df in surveys.items():
        print(f"\n{survey_id}:")
        print(f"  Shape: {df.shape}")
        print(f"  Columns: {list(df.columns)}")
        print(f"  Source: {df.attrs.get('source_file', 'N/A')}")
        print(f"  Sample data:\n{df.head(3)}")

TEST 4: Load All Surveys
âœ“ Successfully loaded 3 surveys

Loaded 3 surveys:

survey_2020:
  Shape: (100, 4)
  Columns: ['pid', 'age', 'income', 'job_satisfaction']
  Source: C:\Users\alecw\AppData\Local\Temp\tmp5hfc_w_1\survey_2020.csv
  Sample data:
   pid  age  income job_satisfaction
0    1   34   49537             High
1    2   25   40385           Medium
2    3   44   74355           Medium

survey_2021:
  Shape: (100, 4)
  Columns: ['pid', 'age', 'income', 'job_satisfaction']
  Source: C:\Users\alecw\AppData\Local\Temp\tmp5hfc_w_1\survey_2021.csv
  Sample data:
   pid  age  income job_satisfaction
0    1   41   90622           Medium
1    2   31   98930              Low
2    3   20   83911              Low

survey_2022:
  Shape: (100, 4)
  Columns: ['pid', 'age', 'income', 'job_satisfaction']
  Source: C:\Users\alecw\AppData\Local\Temp\tmp5hfc_w_1\survey_2022.csv
  Sample data:
   pid  age  income job_satisfaction
0    1   18   84197           Medium
1    2   41   86308        

<a id='metadata'></a>
# 4. Metadata & Structure Utilities

Test functions for extracting metadata, wave information, and survey structure analysis.

In [8]:
import importlib
from sociopathit.data import metadata as metadata_module
importlib.reload(metadata_module)
from sociopathit.data.metadata import summarize_surveys, extract_wave_info, get_id_vars

print("Metadata module loaded successfully")

Metadata module loaded successfully


In [9]:
# Test 5: summarize_surveys
print("=" * 60)
print("TEST 5: Summarize Surveys")
print("=" * 60)

# Create test data
test_surveys = {
    'survey_2020': pd.DataFrame({
        'pid': range(1, 1001),
        'age': np.random.randint(18, 65, 1000),
        'income': np.random.randint(30000, 100000, 1000),
        'jobsat': np.random.randint(1, 8, 1000),
        'satguess': np.random.randint(1, 8, 1000)
    }),
    'survey_2021': pd.DataFrame({
        'pid': range(1, 1201),
        'age': np.random.randint(18, 65, 1200),
        'income': np.random.randint(30000, 100000, 1200),
        'jobsat': np.random.randint(1, 8, 1200),
        'perceivedinequality': np.random.randint(1, 6, 1200)
    }),
}

# Add ID column to attrs
test_surveys['survey_2020'].attrs['id_column'] = 'pid'
test_surveys['survey_2021'].attrs['id_column'] = 'pid'

summary = summarize_surveys(
    test_surveys,
    key_constructs=['age', 'income', 'jobsat', 'satguess', 'perceivedinequality']
)

print("\nSurvey Summary:")
print(summary.to_string(index=False))

TEST 5: Summarize Surveys

Survey Summary:
  survey_id    N  n_vars id_column  has_age  has_income  has_jobsat  has_satguess  has_perceivedinequality
survey_2020 1000       5       pid     True        True        True          True                    False
survey_2021 1200       5       pid     True        True        True         False                     True


In [10]:
# Test 6: extract_wave_info
print("=" * 60)
print("TEST 6: Extract Wave Information")
print("=" * 60)

survey_ids = [
    'qwels_jan_2020',
    'qwels_jun_2020', 
    'qwels_jan_2021',
    'qwels_dec_2021',
    'qwels_mar_2022'
]

wave_info = extract_wave_info(survey_ids)

print("\nExtracted Wave Information:")
print(wave_info.to_string(index=False))

TEST 6: Extract Wave Information

Extracted Wave Information:
     survey_id  year month date wave_number wave_id
qwels_jan_2020  2020   jan None        None  wave_1
qwels_jun_2020  2020   jun None        None  wave_2
qwels_jan_2021  2021   jan None        None  wave_3
qwels_dec_2021  2021   dec None        None  wave_4
qwels_mar_2022  2022   mar None        None  wave_5


In [11]:
# Test 7: get_id_vars
print("=" * 60)
print("TEST 7: Identify ID Variables")
print("=" * 60)

# Test with explicit ID column
df_test1 = pd.DataFrame({
    'pid': range(1, 101),
    'age': np.random.randint(18, 65, 100),
    'income': np.random.randint(30000, 100000, 100)
})

id_col = get_id_vars(df_test1)
print(f"\nTest 1 - Explicit ID column:")
print(f"  Identified ID: {id_col}")

# Test with first column as ID
df_test2 = pd.DataFrame({
    'respondent_number': range(1, 101),
    'age': np.random.randint(18, 65, 100),
    'income': np.random.randint(30000, 100000, 100)
})

id_col = get_id_vars(df_test2)
print(f"\nTest 2 - First column as ID:")
print(f"  Identified ID: {id_col}")

# Test return_all
all_ids = get_id_vars(df_test1, return_all=True)
print(f"\nTest 3 - All potential IDs:")
print(f"  All IDs: {all_ids}")

TEST 7: Identify ID Variables

Test 1 - Explicit ID column:
  Identified ID: pid

Test 2 - First column as ID:
  Identified ID: respondent_number

Test 3 - All potential IDs:
  All IDs: ['pid']


<a id='longitudinal'></a>
# 5. Longitudinal Data Management

Test functions for detecting panel structure, aligning waves, and managing longitudinal datasets.

In [12]:
import importlib
from sociopathit.data import longitudinal as longitudinal_module
importlib.reload(longitudinal_module)
from sociopathit.data.longitudinal import detect_longitudinal, align_longitudinal_data, sort_by_wave

print("Longitudinal module loaded successfully")

Longitudinal module loaded successfully


In [13]:
# Test 8: detect_longitudinal
print("=" * 60)
print("TEST 8: Detect Longitudinal Structure")
print("=" * 60)

# Create panel data (same individuals across waves)
np.random.seed(42)
shared_pids = list(range(1, 51))  # 50 individuals
new_pids_wave2 = list(range(51, 61))  # 10 new in wave 2
new_pids_wave3 = list(range(61, 71))  # 10 new in wave 3

panel_data = {
    'wave1': pd.DataFrame({
        'pid': shared_pids,
        'age': np.random.randint(18, 65, len(shared_pids)),
        'income': np.random.randint(30000, 100000, len(shared_pids))
    }),
    'wave2': pd.DataFrame({
        'pid': shared_pids + new_pids_wave2,
        'age': np.random.randint(18, 65, len(shared_pids + new_pids_wave2)),
        'income': np.random.randint(30000, 100000, len(shared_pids + new_pids_wave2))
    }),
    'wave3': pd.DataFrame({
        'pid': shared_pids + new_pids_wave3,
        'age': np.random.randint(18, 65, len(shared_pids + new_pids_wave3)),
        'income': np.random.randint(30000, 100000, len(shared_pids + new_pids_wave3))
    })
}

result = detect_longitudinal(panel_data, id_col='pid')

print(f"\nIs longitudinal: {result['is_longitudinal']}")
print(f"ID column: {result['id_column']}")
print(f"Number of shared IDs: {len(result['shared_ids'])}")
print(f"\nOverlap Matrix:")
print(result['overlap_matrix'])

TEST 8: Detect Longitudinal Structure

Is longitudinal: True
ID column: pid
Number of shared IDs: 50

Overlap Matrix:
       wave1  wave2  wave3
wave1   50.0   50.0   50.0
wave2   50.0   60.0   50.0
wave3   50.0   50.0   60.0


In [14]:
# Test 9: align_longitudinal_data
print("=" * 60)
print("TEST 9: Align Longitudinal Data")
print("=" * 60)

# Use panel data from previous test
wave_labels = {
    'wave1': 'wave_1',
    'wave2': 'wave_2',
    'wave3': 'wave_3'
}

long_df = align_longitudinal_data(
    panel_data,
    id_col='pid',
    wave_labels=wave_labels,
    align_vars=['age', 'income']
)

print(f"\nLong-form dataset shape: {long_df.shape}")
print(f"Columns: {list(long_df.columns)}")
print(f"\nSample data (first individual across waves):")
print(long_df[long_df['pid'] == 1])

TEST 9: Align Longitudinal Data

Long-form dataset shape: (170, 5)
Columns: ['pid', 'age', 'income', 'wave', 'source']

Sample data (first individual across waves):
   pid  age  income    wave source
0    1   56   53483  wave_1  wave1
1    1   29   51976  wave_2  wave2
2    1   45   77333  wave_3  wave3


In [15]:
# Test 10: sort_by_wave
print("=" * 60)
print("TEST 10: Sort by Wave")
print("=" * 60)

# Create unsorted long data
unsorted_df = long_df.sample(frac=1, random_state=42)  # Shuffle

print("Before sorting:")
print(unsorted_df.head(10))

sorted_df = sort_by_wave(unsorted_df, id_col='pid', wave_col='wave')

print("\nAfter sorting:")
print(sorted_df.head(10))

TEST 10: Sort by Wave
Before sorting:
     pid  age  income    wave source
139   47   39   72107  wave_2  wave2
30    11   28   39692  wave_1  wave1
119   40   35   34014  wave_3  wave3
29    10   43   35600  wave_3  wave3
144   49   19   38392  wave_1  wave1
163   64   20   59592  wave_3  wave3
166   67   29   32368  wave_3  wave3
51    18   19   31016  wave_1  wave1
105   36   61   78190  wave_1  wave1
60    21   47   54300  wave_1  wave1

After sorting:
   pid  age  income    wave source
0    1   56   53483  wave_1  wave1
1    1   29   51976  wave_2  wave2
2    1   45   77333  wave_3  wave3
3    2   46   78555  wave_1  wave1
4    2   51   74262  wave_2  wave2
5    2   42   33436  wave_3  wave3
6    3   32   47159  wave_1  wave1
7    3   50   53776  wave_2  wave2
8    3   40   35895  wave_3  wave3
9    4   60   65920  wave_1  wave1


<a id='preparation'></a>
# 6. Harmonization & Pre-Analysis Preparation

Test functions for harmonizing variables, building datasets, and preparing data for statistical analysis.

In [16]:
import importlib
from sociopathit.data import preparation as preparation_module
importlib.reload(preparation_module)
from sociopathit.data.preparation import (
    harmonize_columns,
    build_harmonized_dataset,
    to_categorical_ordered,
    numeric_codes,
    prepare_for_analysis
)

print("Preparation module loaded successfully")

Preparation module loaded successfully


In [17]:
# Test 11: harmonize_columns
print("=" * 60)
print("TEST 11: Harmonize Columns")
print("=" * 60)

# Create test data with inconsistent labels
df_inconsistent = pd.DataFrame({
    'pid': range(1, 101),
    'satisfaction': np.random.choice(
        ['very satisfied', 'satisfied', 'neutral', 'dissatisfied', 'very dissatisfied'],
        100
    ),
    'agreement': np.random.choice(
        ['strongly agree', 'agree', 'neutral', 'disagree', 'strongly disagree', 'dk'],
        100
    )
})

print("Before harmonization:")
print(f"Satisfaction values: {df_inconsistent['satisfaction'].unique()}")
print(f"Agreement values: {df_inconsistent['agreement'].unique()}")

# Define mappings
mappings = {
    'satisfaction': {
        'very satisfied': 'Very satisfied',
        'satisfied': 'Satisfied',
        'neutral': 'Neutral',
        'dissatisfied': 'Dissatisfied',
        'very dissatisfied': 'Very dissatisfied'
    },
    'agreement': {
        'strongly agree': 'Strongly agree',
        'agree': 'Agree',
        'neutral': 'Neutral',
        'disagree': 'Disagree',
        'strongly disagree': 'Strongly disagree'
    }
}

df_harmonized = harmonize_columns(df_inconsistent, mappings, handle_missing='na')

print("\nAfter harmonization:")
print(f"Satisfaction values: {df_harmonized['satisfaction'].unique()}")
print(f"Agreement values: {df_harmonized['agreement'].dropna().unique()}")
print(f"\nShape before: {df_inconsistent.shape}, after: {df_harmonized.shape}")

TEST 11: Harmonize Columns
Before harmonization:
Satisfaction values: ['satisfied' 'neutral' 'very satisfied' 'very dissatisfied' 'dissatisfied']
Agreement values: ['strongly disagree' 'disagree' 'neutral' 'strongly agree' 'agree' 'dk']

After harmonization:
Satisfaction values: ['Satisfied' 'Neutral' 'Very satisfied' 'Very dissatisfied' 'Dissatisfied']
Agreement values: ['Strongly disagree' 'Disagree' 'Neutral' 'Strongly agree' 'Agree']

Shape before: (100, 3), after: (100, 3)


In [18]:
# Test 12: build_harmonized_dataset
print("=" * 60)
print("TEST 12: Build Harmonized Dataset")
print("=" * 60)

# Create multi-wave data
wave_data = {
    'survey_2020': pd.DataFrame({
        'pid': range(1, 101),
        'age': np.random.randint(18, 65, 100),
        'income': np.random.randint(30000, 100000, 100),
        'satisfaction': np.random.randint(1, 8, 100),
        'extra_2020': np.random.random(100)
    }),
    'survey_2021': pd.DataFrame({
        'pid': range(1, 101),
        'age': np.random.randint(18, 65, 100),
        'income': np.random.randint(30000, 100000, 100),
        'satisfaction': np.random.randint(1, 8, 100),
        'extra_2021': np.random.random(100)
    }),
    'survey_2022': pd.DataFrame({
        'pid': range(1, 101),
        'age': np.random.randint(18, 65, 100),
        'income': np.random.randint(30000, 100000, 100),
        'satisfaction': np.random.randint(1, 8, 100),
        'extra_2022': np.random.random(100)
    })
}

harmonized = build_harmonized_dataset(
    wave_data,
    target_vars=['age', 'income', 'satisfaction'],
    id_col='pid',
    add_identifiers=True
)

print(f"\nHarmonized dataset shape: {harmonized.shape}")
print(f"Columns: {list(harmonized.columns)}")
print(f"\nUnique surveys: {harmonized['surveyid'].unique()}")
print(f"Wave labels: {harmonized['wave'].unique()}")
print(f"\nSample data:")
print(harmonized.head(10))

TEST 12: Build Harmonized Dataset

Harmonized dataset shape: (300, 7)
Columns: ['pid', 'age', 'income', 'satisfaction', 'surveyid', 'year', 'wave']

Unique surveys: ['survey_2020' 'survey_2021' 'survey_2022']
Wave labels: ['wave_1' 'wave_2' 'wave_3']

Sample data:
   pid  age  income  satisfaction     surveyid  year    wave
0    1   60   57355             4  survey_2020  2020  wave_1
1    2   28   34835             5  survey_2020  2020  wave_1
2    3   35   50159             2  survey_2020  2020  wave_1
3    4   64   77605             6  survey_2020  2020  wave_1
4    5   29   68088             5  survey_2020  2020  wave_1
5    6   26   85284             2  survey_2020  2020  wave_1
6    7   27   87043             5  survey_2020  2020  wave_1
7    8   61   65547             6  survey_2020  2020  wave_1
8    9   34   57532             3  survey_2020  2020  wave_1
9   10   55   64349             3  survey_2020  2020  wave_1


In [19]:
# Test 13: to_categorical_ordered
print("=" * 60)
print("TEST 13: Convert to Ordered Categorical")
print("=" * 60)

df_test = pd.DataFrame({
    'response': ['Agree', 'Disagree', 'Strongly agree', 'Neutral', 'Strongly disagree'] * 20
})

order = ['Strongly disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly agree']

print("Before conversion:")
print(f"  Type: {df_test['response'].dtype}")
print(f"  Sample: {df_test['response'].head()}")

df_ordered = to_categorical_ordered(df_test, 'response', order)

print("\nAfter conversion:")
print(f"  Type: {df_ordered['response'].dtype}")
print(f"  Ordered: {df_ordered['response'].dtype.ordered}")
print(f"  Categories: {list(df_ordered['response'].cat.categories)}")

TEST 13: Convert to Ordered Categorical
Before conversion:
  Type: object
  Sample: 0                Agree
1             Disagree
2       Strongly agree
3              Neutral
4    Strongly disagree
Name: response, dtype: object

After conversion:
  Type: category
  Ordered: True
  Categories: ['Strongly disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly agree']


In [20]:
# Test 14: numeric_codes
print("=" * 60)
print("TEST 14: Convert to Numeric Codes")
print("=" * 60)

# Use ordered categorical from previous test
numeric = numeric_codes(df_ordered['response'], start=1)

print("Original vs. Numeric:")
comparison = pd.DataFrame({
    'original': df_ordered['response'].head(10),
    'numeric': numeric.head(10)
})
print(comparison)

TEST 14: Convert to Numeric Codes
Original vs. Numeric:
            original  numeric
0              Agree      4.0
1           Disagree      2.0
2     Strongly agree      5.0
3            Neutral      3.0
4  Strongly disagree      1.0
5              Agree      4.0
6           Disagree      2.0
7     Strongly agree      5.0
8            Neutral      3.0
9  Strongly disagree      1.0


In [21]:
# Test 15: prepare_for_analysis
print("=" * 60)
print("TEST 15: Prepare for Analysis")
print("=" * 60)

# Create messy data
df_messy = pd.DataFrame({
    'PID': range(1, 201),
    'Age': np.concatenate([np.random.randint(18, 65, 190), [-99] * 10]),  # Missing codes
    'Income': np.concatenate([np.random.randint(30000, 100000, 180), [-99] * 20]),
    'JobSat': np.random.choice(['Very satisfied', 'Satisfied', 'Neutral', 'Dissatisfied'], 200),
    'Weight': np.random.uniform(0.5, 1.5, 200)
})

# Add some rows with lots of missing data
df_messy.loc[195:199, ['Age', 'Income', 'JobSat']] = np.nan

print("Before preparation:")
print(f"  Shape: {df_messy.shape}")
print(f"  Columns: {list(df_messy.columns)}")
print(f"  Missing -99 in Age: {(df_messy['Age'] == -99).sum()}")
print(f"  Weight sum: {df_messy['Weight'].sum():.2f}")

df_clean = prepare_for_analysis(
    df_messy,
    satisfaction_vars=['jobsat'],
    weight_col='weight',
    normalize_weights=True,
    min_valid_pct=0.6
)

print("\nAfter preparation:")
print(f"  Shape: {df_clean.shape}")
print(f"  Columns: {list(df_clean.columns)}")
print(f"  Missing -99 in age: {(df_clean['age'] == -99).sum()}")
print(f"  Weight sum: {df_clean['weight'].sum():.2f}")
print(f"  JobSat type: {df_clean['jobsat'].dtype}")

TEST 15: Prepare for Analysis
Before preparation:
  Shape: (200, 5)
  Columns: ['PID', 'Age', 'Income', 'JobSat', 'Weight']
  Missing -99 in Age: 5
  Weight sum: 200.17
Removed 5 rows with >40% missing data

After preparation:
  Shape: (195, 5)
  Columns: ['pid', 'age', 'income', 'jobsat', 'weight']
  Missing -99 in age: 0
  Weight sum: 195.21
  JobSat type: category


# 7. Enhanced Dynamic Data Functions

Test new functions for dynamic data loading, ID normalization, and automated longitudinal dataset construction.

In [22]:
# Test 16: normalize_ids - ID normalization with .0 removal
print("=" * 60)
print("TEST 16: Normalize IDs")
print("=" * 60)

from sociopathit.data.loading import normalize_ids

# Create test data with problematic IDs
df_messy_ids = pd.DataFrame({
    'id': [1.0, 2.0, 3.0, 4.0, 5.0, np.nan, 7.0, 8.0, 9.0, 10.0],
    'age': np.random.randint(18, 65, 10),
    'value': np.random.random(10)
})

print("Before normalization:")
print(f"  ID dtype: {df_messy_ids['id'].dtype}")
print(f"  Sample IDs: {df_messy_ids['id'].head().tolist()}")
print(f"  Missing IDs: {df_messy_ids['id'].isna().sum()}")

# Normalize with default settings
df_normalized = normalize_ids(df_messy_ids, id_cols='id', handle_missing='keep')

print("\nAfter normalization (keep missing):")
print(f"  ID dtype: {df_normalized['id'].dtype}")
print(f"  Sample IDs: {df_normalized['id'].head().tolist()}")
print(f"  Missing IDs: {df_normalized['id'].isna().sum()}")

# Test with filling missing IDs
df_normalized_fill = normalize_ids(df_messy_ids, id_cols='id', handle_missing='fill')

print("\nAfter normalization (fill missing):")
print(f"  Missing IDs: {df_normalized_fill['id'].isna().sum()}")
print(f"  All IDs: {sorted(df_normalized_fill['id'].tolist())}")

# Test with string IDs containing .0
df_string_ids = pd.DataFrame({
    'id': ['123.0', '456.0', '789.0', '1011.0'],
    'value': [1, 2, 3, 4]
})

print("\nString IDs with .0 suffix:")
print(f"  Before: {df_string_ids['id'].tolist()}")
df_string_norm = normalize_ids(df_string_ids, id_cols='id')
print(f"  After: {df_string_norm['id'].tolist()}")
print(f"  After dtype: {df_string_norm['id'].dtype}")

TEST 16: Normalize IDs
Before normalization:
  ID dtype: float64
  Sample IDs: [1.0, 2.0, 3.0, 4.0, 5.0]
  Missing IDs: 1

After normalization (keep missing):
  ID dtype: Int64
  Sample IDs: [1, 2, 3, 4, 5]
  Missing IDs: 1

After normalization (fill missing):
  Missing IDs: 0
  All IDs: [1, 2, 3, 4, 5, 7, 8, 9, 10, 11]

String IDs with .0 suffix:
  Before: ['123.0', '456.0', '789.0', '1011.0']
  After: [123, 456, 789, 1011]
  After dtype: Int64


In [23]:
# Test 17: load_and_combine - Multiple combine methods
print("=" * 60)
print("TEST 17: Load and Combine Datasets")
print("=" * 60)

from sociopathit.data.loading import load_and_combine

with TemporaryDirectory() as tmpdir:
    tmpdir = Path(tmpdir)
    
    # Create test datasets with different file types
    for year in [2020, 2021, 2022]:
        df_test = pd.DataFrame({
            'pid': [i + 0.0 for i in range(1, 51)],  # IDs with .0
            'age': np.random.randint(18, 65, 50),
            'income': np.random.randint(30000, 100000, 50),
            'satisfaction': np.random.randint(1, 8, 50)
        })
        df_test.to_csv(tmpdir / f"survey_{year}.csv", index=False)
    
    # Also create a test file to exclude
    df_test.to_csv(tmpdir / "survey_2023_test.csv", index=False)
    
    print("Test A: Separate method (returns dict)")
    print("-" * 60)
    data_separate = load_and_combine(
        tmpdir,
        file_types=['csv'],
        exclude_patterns=['*_test*'],
        combine_method='separate',
        normalize_id=True
    )
    
    print(f"  Loaded {len(data_separate)} datasets:")
    for survey_id, df in data_separate.items():
        print(f"    - {survey_id}: {df.shape}, ID dtype: {df['pid'].dtype}")
    
    print("\nTest B: Concat method (long-form)")
    print("-" * 60)
    data_concat = load_and_combine(
        tmpdir,
        file_types=['csv'],
        exclude_patterns=['*_test*'],
        target_vars=['pid', 'age', 'income', 'satisfaction'],
        combine_method='concat',
        normalize_id=True
    )
    
    print(f"  Combined shape: {data_concat.shape}")
    print(f"  Columns: {list(data_concat.columns)}")
    print(f"  Unique sources: {data_concat['source'].unique()}")
    print(f"  Unique waves: {sorted(data_concat['wave'].dropna().unique())}")
    
    print("\nTest C: Merge method (wide-form)")
    print("-" * 60)
    data_merge = load_and_combine(
        tmpdir,
        file_types=['csv'],
        exclude_patterns=['*_test*'],
        target_vars=['age', 'income'],
        combine_method='merge',
        id_col='pid',
        normalize_id=True
    )
    
    print(f"  Merged shape: {data_merge.shape}")
    print(f"  Columns: {list(data_merge.columns)}")
    print(f"  Sample data:")
    print(data_merge.head(3))

TEST 17: Load and Combine Datasets
Test A: Separate method (returns dict)
------------------------------------------------------------
âœ“ Loaded 3 datasets
  File types: csv=3
âš  Skipped 1 files
  - survey_2023_test.csv: matched exclusion pattern
  Loaded 3 datasets:
    - survey_2020: (50, 4), ID dtype: Int64
    - survey_2021: (50, 4), ID dtype: Int64
    - survey_2022: (50, 4), ID dtype: Int64

Test B: Concat method (long-form)
------------------------------------------------------------
âœ“ Loaded 3 datasets
  File types: csv=3
âš  Skipped 1 files
  - survey_2023_test.csv: matched exclusion pattern
  Combined shape: (150, 6)
  Columns: ['pid', 'age', 'income', 'satisfaction', 'source', 'wave']
  Unique sources: ['survey_2020' 'survey_2021' 'survey_2022']
  Unique waves: [np.int64(2020), np.int64(2021), np.int64(2022)]

Test C: Merge method (wide-form)
------------------------------------------------------------
âœ“ Loaded 3 datasets
  File types: csv=3
âš  Skipped 1 files
  - sur

In [24]:
# Test 18: extract_wave_info with directory path
print("=" * 60)
print("TEST 18: Extract Wave Info from Directory")
print("=" * 60)

from sociopathit.data.metadata import extract_wave_info

with TemporaryDirectory() as tmpdir:
    tmpdir = Path(tmpdir)
    
    # Create test data files with date information in names
    for year, month in [(2020, 'jan'), (2020, 'jun'), (2021, 'jan'), (2021, 'dec'), (2022, 'mar')]:
        df_test = pd.DataFrame({
            'pid': range(1, 51),
            'value': np.random.random(50)
        })
        df_test.to_csv(tmpdir / f"qwels_{month}_{year}.csv", index=False)
    
    print("Test A: Extract from directory path")
    print("-" * 60)
    wave_info = extract_wave_info(tmpdir)
    
    print(f"\nExtracted wave information from {len(wave_info)} files:")
    print(wave_info.to_string(index=False))
    
    print("\nTest B: Extract from dict of DataFrames")
    print("-" * 60)
    data_dict = {
        'survey_wave_1': pd.DataFrame({'pid': range(1, 51)}),
        'survey_wave_2': pd.DataFrame({'pid': range(1, 51)}),
        'survey_wave_3': pd.DataFrame({'pid': range(1, 51)}),
    }
    
    wave_info_dict = extract_wave_info(data_dict)
    print(wave_info_dict.to_string(index=False))
    
    print("\nTest C: Extract from list of IDs")
    print("-" * 60)
    survey_ids = ['study_2020', 'study_2021', 'study_2022']
    wave_info_list = extract_wave_info(survey_ids)
    print(wave_info_list.to_string(index=False))

TEST 18: Extract Wave Info from Directory
Test A: Extract from directory path
------------------------------------------------------------

Extracted wave information from 5 files:
     survey_id  year month date wave_number wave_id
qwels_jan_2020  2020   jan None        None  wave_1
qwels_jun_2020  2020   jun None        None  wave_2
qwels_dec_2021  2021   dec None        None  wave_3
qwels_jan_2021  2021   jan None        None  wave_4
qwels_mar_2022  2022   mar None        None  wave_5

Test B: Extract from dict of DataFrames
------------------------------------------------------------
    survey_id year month date  wave_number wave_id
survey_wave_1 None  None None            1  wave_1
survey_wave_2 None  None None            2  wave_2
survey_wave_3 None  None None            3  wave_3

Test C: Extract from list of IDs
------------------------------------------------------------
 survey_id  year month date wave_number wave_id
study_2020  2020  None None        None  wave_1
study_2021

In [25]:
# Test 19: build_longit_from_dir - Complete workflow
print("=" * 60)
print("TEST 19: Build Longitudinal Dataset from Directory")
print("=" * 60)

from sociopathit.data.longitudinal import build_longit_from_dir

with TemporaryDirectory() as tmpdir:
    tmpdir = Path(tmpdir)
    
    # Create longitudinal data files (same individuals across waves)
    np.random.seed(42)
    shared_pids = list(range(1, 81))  # 80 core individuals
    
    for year in [2020, 2021, 2022]:
        # Add some new respondents in each wave
        if year == 2021:
            wave_pids = shared_pids + list(range(81, 91))
        elif year == 2022:
            wave_pids = shared_pids + list(range(91, 101))
        else:
            wave_pids = shared_pids
        
        df_wave = pd.DataFrame({
            'pid': [float(p) for p in wave_pids],  # IDs with .0 
            'age': np.random.randint(18, 65, len(wave_pids)),
            'income': np.random.randint(30000, 100000, len(wave_pids)),
            'satisfaction': np.random.randint(1, 8, len(wave_pids)),
            'health': np.random.randint(1, 6, len(wave_pids)),
            'extra_var': np.random.random(len(wave_pids))  # This won't be common
        })
        df_wave.to_csv(tmpdir / f"survey_wave_{year}.csv", index=False)
    
    # Also create a file to exclude
    df_wave.to_csv(tmpdir / "backup_survey.csv", index=False)
    
    print("Building longitudinal dataset from directory...")
    print("-" * 60)
    
    # Build long-form dataset
    long_df = build_longit_from_dir(
        data_dir=tmpdir,
        target_vars=['age', 'income', 'satisfaction', 'health'],
        id_col='pid',
        file_types=['csv'],
        exclude_patterns=['backup_*'],
        normalize_ids=True,
        auto_detect_waves=True
    )
    
    print(f"\nâœ“ Longitudinal Dataset Created")
    print(f"  Shape: {long_df.shape}")
    print(f"  Columns: {list(long_df.columns)}")
    print(f"  Unique individuals: {long_df['pid'].nunique()}")
    print(f"  Waves: {sorted(long_df['wave'].unique())}")
    print(f"  ID dtype: {long_df['pid'].dtype}")
    
    print(f"\n  Wave distribution:")
    print(long_df['wave'].value_counts().sort_index())
    
    print(f"\n  Sample data (first individual across waves):")
    print(long_df[long_df['pid'] == 1])
    
    print(f"\n  Verify ID normalization (no .0 issues):")
    print(f"    Sample IDs: {long_df['pid'].head(10).tolist()}")

TEST 19: Build Longitudinal Dataset from Directory
Building longitudinal dataset from directory...
------------------------------------------------------------
Loading data from C:\Users\alecw\AppData\Local\Temp\tmp4cqyuax1...
âœ“ Loaded 3 datasets
  File types: csv=3
âš  Skipped 1 files
  - backup_survey.csv: matched exclusion pattern
Detecting wave structure...
Using ID column: pid
Normalizing IDs...
Building long-form dataset...
âœ“ Created long-form dataset: 260 observations
  - 100 unique individuals
  - 3 waves
  - 4 variables

âœ“ Longitudinal Dataset Created
  Shape: (260, 7)
  Columns: ['pid', 'age', 'income', 'satisfaction', 'health', 'wave', 'source']
  Unique individuals: 100
  Waves: ['wave_1', 'wave_2', 'wave_3']
  ID dtype: Int64

  Wave distribution:
wave
wave_1    80
wave_2    90
wave_3    90
Name: count, dtype: int64

  Sample data (first individual across waves):
   pid  age  income  satisfaction  health    wave            source
0    1   56   53247             7    