# Pedagogical Report: Teaching Propensity Score Matching

**INFO 7390: Advanced Data Science and Architecture**

**Author:** Nikshipth Narayan Bondugula

**Date:** December 2025

**Topic:** Propensity Score Matching for Causal Inference

---

## Table of Contents

1. Teaching Philosophy
2. Concept Deep Dive
3. Implementation Analysis
4. Assessment & Effectiveness

---

## 1. Teaching Philosophy

### 1.1 Target Audience

This tutorial is designed for **graduate students in data science, statistics, or related quantitative fields** who have foundational knowledge but are new to causal inference methods.

**Assumed Background:**

- Proficiency in Python programming and pandas library
- Understanding of logistic regression and probability concepts
- Familiarity with basic statistical inference (hypothesis testing, confidence intervals)
- Experience with data visualization using matplotlib/seaborn
- No prior exposure to causal inference or propensity score methods required

**Audience Personas:**

| Persona | Background | Goals |
|---------|------------|-------|
| Data Science Student | Strong in ML, weak in causal methods | Learn when correlation ≠ causation |
| Healthcare Analyst | Domain expertise, moderate coding | Evaluate treatment effectiveness |
| Policy Researcher | Statistics background | Assess program impact from observational data |

### 1.2 Learning Objectives

By completing this tutorial, students will be able to:

**Knowledge (Conceptual Understanding):**

1. Explain the fundamental problem of causal inference and why naive comparisons fail
2. Define propensity scores and articulate the theoretical basis for their use
3. Identify the three critical assumptions (unconfoundedness, positivity, SUTVA) and their implications
4. Distinguish between ATE and ATT and know when each is appropriate

**Skills (Practical Application):**

1. Implement a complete PSM pipeline in Python from scratch
2. Estimate propensity scores using logistic regression
3. Perform nearest neighbor matching with appropriate caliper selection
4. Assess covariate balance using SMD and Love plots
5. Calculate treatment effects with confidence intervals

**Critical Thinking (Evaluation):**

1. Evaluate when PSM is appropriate versus alternative methods
2. Identify potential unmeasured confounders in real-world scenarios
3. Apply computational skepticism to causal claims from observational studies
4. Recognize limitations and communicate uncertainty appropriately

### 1.3 Pedagogical Approach and Rationale

**Core Philosophy: "Learn by Teaching, Understand by Doing"**

This tutorial employs a constructivist learning approach where students actively build understanding through hands-on implementation rather than passive absorption of theory.

**Instructional Design Principles:**

**1. Scaffolded Learning Progression**

The tutorial follows a carefully sequenced structure:

- **Motivation First:** Begin with a compelling real-world problem (cardiac rehab effectiveness) before introducing technical concepts
- **Concept → Math → Code:** Each topic progresses from intuition to formalization to implementation
- **Progressive Complexity:** Start with simple visualizations, advance to complete pipeline implementation

**2. Multiple Representations**

Each concept is presented through multiple modalities:

- **Verbal:** Written explanations with analogies and plain language
- **Mathematical:** Formal notation and equations for precision
- **Visual:** Diagrams, flowcharts, and data visualizations
- **Computational:** Working Python code with extensive comments

This multi-modal approach accommodates diverse learning styles and reinforces understanding through repetition in different forms.

**3. Active Learning Through Exercises**

The tutorial includes three levels of exercises:

- **Beginner (Conceptual):** Identify confounders and bias direction
- **Intermediate (Implementation):** Complete a PSM pipeline on new data
- **Advanced (Critical Thinking):** Evaluate real-world study limitations

Additionally, "Try It Yourself" challenges encourage experimentation with parameters.

**4. Worked Examples with Visible Thinking**

All code includes extensive comments explaining not just "what" but "why":

- Design decisions are explicitly discussed
- Trade-offs are acknowledged
- Common mistakes are highlighted proactively

**5. Immediate Feedback Loops**

- Expected outputs are provided so students can verify correctness
- Diagnostic functions help identify problems
- The debugging guide addresses common errors

---

## 2. Concept Deep Dive

### 2.1 Technical and Mathematical Foundations

**The Fundamental Problem of Causal Inference**

At the heart of causal inference lies a fundamental impossibility: we can never observe both potential outcomes for the same individual. For any patient i, we define:

- Y_i(1): The outcome if patient i receives treatment
- Y_i(0): The outcome if patient i does not receive treatment

The individual treatment effect τ_i = Y_i(1) - Y_i(0) is inherently unobservable because each patient either receives treatment or doesn't—never both.

**The Propensity Score Solution**

Rosenbaum and Rubin (1983) proved a remarkable result: if treatment assignment is strongly ignorable given covariates X, it remains strongly ignorable given only the propensity score e(X) = P(T=1|X). This dimensionality reduction is powerful—instead of matching on potentially dozens of covariates, we match on a single scalar.

**Mathematical Framework:**

The propensity score is estimated via logistic regression:

```
log(e(X)/(1-e(X))) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ
```

The key assumptions are:

1. **Unconfoundedness:** Y(0), Y(1) ⊥ T | X
   - Potential outcomes are independent of treatment given covariates
   - This is untestable and requires domain knowledge

2. **Positivity:** 0 < P(T=1|X) < 1 for all X
   - Every covariate combination has some probability of each treatment
   - Testable by examining propensity score distributions

3. **SUTVA:** No interference between units; treatment is well-defined
   - One person's treatment doesn't affect another's outcome
   - Requires careful consideration of the study context

**The Balancing Property:**

A crucial insight is that the propensity score is a balancing score: X ⊥ T | e(X). Among individuals with the same propensity score, the distribution of covariates is the same for treated and control groups. This property enables us to remove confounding by matching on propensity scores.

### 2.2 Connection to Course Themes

**GIGO (Garbage In, Garbage Out)**

The GIGO principle applies powerfully to PSM:

- **Garbage Confounders → Garbage Estimates:** If we fail to include important confounders in our propensity model, the resulting matches will not be truly comparable, and our causal estimates will be biased
- **Garbage Data Quality → Garbage Propensity Scores:** Missing data, measurement error, or coding mistakes in covariates propagate through to poor propensity estimates
- **No Statistical Fix for Missing Variables:** Unlike prediction tasks where we can sometimes compensate for missing features, causal inference fundamentally requires measuring the right confounders

The tutorial emphasizes this through:

- Explicit discussion of unmeasured confounding
- Sensitivity analysis showing how omitting a confounder biases results
- "Break the Method" exercises demonstrating failure modes

**Computational Skepticism**

PSM embodies computational skepticism by forcing us to question causal claims:

- **Question Assumptions:** The unconfoundedness assumption cannot be tested—we must reason about what might be missing
- **Verify Before Trusting:** Balance checks (SMD, Love plots) verify that matching actually worked
- **Understand Limitations:** Even well-executed PSM can't prove causation; it only removes bias from measured confounders

The tutorial cultivates skepticism through:

- Comparing naive estimates to PSM estimates to true effects
- Showing how easy it is to get wrong answers
- Critical thinking exercises about real-world studies

**Botspeak Framework**

The tutorial leverages AI collaboration principles:

- **Clear Communication:** Technical concepts are explained in plain language first
- **Structured Prompts:** Code cells are organized with clear inputs, processes, and outputs
- **Iterative Refinement:** The debugging guide helps students iterate toward correct solutions

### 2.3 Relationship to Real-World Data Science Workflows

**Where PSM Fits in the Data Science Pipeline:**

1. **Problem Formulation:** Define causal question, identify treatment and outcome
2. **Data Collection:** Gather observational data with relevant confounders
3. **Exploratory Analysis:** Visualize selection bias, check data quality
4. **PSM Analysis:** Estimate propensity scores, match, assess balance
5. **Effect Estimation:** Calculate ATT/ATE with uncertainty quantification
6. **Sensitivity Analysis:** Assess robustness to assumptions
7. **Communication:** Report findings with appropriate caveats

**Industry Applications:**

| Domain | Example Application |
|--------|---------------------|
| Healthcare | Evaluating treatment effectiveness from EHR data |
| Marketing | Measuring campaign ROI without A/B testing |
| Public Policy | Assessing job training program impact |
| Technology | Estimating feature impact from observational logs |
| Finance | Evaluating intervention effects on customer behavior |

**Why This Matters for Data Scientists:**

Many real-world questions are causal, but randomized experiments are often impossible due to:

- Ethical constraints (can't randomly deny beneficial treatment)
- Practical limitations (too expensive, too slow)
- Retrospective analysis needs (data already collected)

PSM provides a principled approach to extract causal insights from observational data—a critical skill as organizations increasingly rely on data-driven decision making.

---

## 3. Implementation Analysis

### 3.1 Architecture and Design Decisions

**Modular Function Design**

The implementation follows software engineering best practices with modular, reusable functions:

```
estimate_propensity_scores()  →  Returns: scores, model, scaler
nearest_neighbor_matching()   →  Returns: matched_df, match_info
calculate_smd()               →  Returns: standardized mean difference
assess_balance()              →  Returns: balance DataFrame
estimate_att()                →  Returns: results dictionary
create_love_plot()            →  Returns: matplotlib figure
```

**Design Rationale:**

- **Single Responsibility:** Each function does one thing well
- **Clear Interfaces:** Explicit inputs and outputs with docstrings
- **Composability:** Functions can be chained together or used independently
- **Testability:** Isolated functions are easier to debug and verify

**Data Flow Architecture:**

```
Raw Data → Propensity Estimation → Matching → Balance Check → Effect Estimation
    ↓              ↓                   ↓            ↓               ↓
 DataFrame    PS Column Added    Matched DF    SMD Table      ATT + CI
```

**Key Implementation Choices:**

| Decision | Choice Made | Rationale |
|----------|-------------|-----------|
| PS Estimation | Logistic Regression | Standard, interpretable, well-understood |
| Feature Scaling | StandardScaler | Improves convergence, required for regularization |
| Matching Algorithm | Nearest Neighbor | Intuitive, widely used, good baseline |
| Matching Direction | Treated → Control | Estimates ATT (most common estimand) |
| Default Caliper | 0.2 × SD | Industry standard, balances match quality and quantity |
| Replacement | Without (default) | Simpler variance estimation for teaching |

### 3.2 Libraries and Tools

**Core Dependencies:**

| Library | Version | Purpose |
|---------|---------|---------|
| numpy | ≥1.20 | Numerical computations |
| pandas | ≥1.3 | Data manipulation |
| matplotlib | ≥3.4 | Static visualizations |
| seaborn | ≥0.11 | Statistical visualizations |
| scikit-learn | ≥0.24 | Logistic regression, scaling |
| scipy | ≥1.7 | Statistical tests |

**Why These Choices:**

- **Ubiquity:** All libraries are standard in data science curricula
- **Stability:** Mature, well-maintained packages with excellent documentation
- **Pedagogical Value:** Students likely already know these tools
- **No Black Boxes:** Using sklearn's LogisticRegression rather than specialized PSM packages keeps the implementation transparent

**Alternative Libraries (Mentioned for Reference):**

For production use, students are pointed to specialized packages:

- `pymatch`: Full PSM pipeline with additional features
- `causalinference`: Comprehensive causal inference toolkit
- `DoWhy`: Microsoft's causal reasoning library
- `EconML`: Machine learning for causal inference

### 3.3 Performance Considerations

**Computational Complexity:**

| Step | Complexity | Bottleneck |
|------|------------|------------|
| PS Estimation | O(n × k × iterations) | Logistic regression convergence |
| Nearest Neighbor Matching | O(n_treated × n_control) | Distance calculations |
| Balance Assessment | O(k × n) | SMD calculations per covariate |

**Scalability Discussion:**

The naive matching implementation has O(n²) complexity in the worst case. For the tutorial's 2,000 patients, this is instantaneous. For larger datasets:

- **10,000 patients:** ~1-2 seconds
- **100,000 patients:** May require optimization
- **1,000,000+ patients:** Need approximate nearest neighbor methods (KD-trees, Ball trees)

The tutorial acknowledges this limitation and points students to optimized libraries for production use.

**Memory Considerations:**

The implementation creates copies of dataframes for clarity. For very large datasets, in-place operations or chunked processing would be necessary.

### 3.4 Edge Cases and Limitations

**Handled Edge Cases:**

| Edge Case | How Addressed |
|-----------|---------------|
| No available controls | Skip treated unit, count as unmatched |
| Perfect separation in PS | Warning in diagnostic function |
| Zero variance covariate | Return SMD = 0 |
| Empty match result | Return empty DataFrame with info |

**Known Limitations:**

1. **No Support for Continuous Treatments:** Implementation assumes binary treatment only

2. **Simple Variance Estimation:** The SE calculation assumes independent observations; doesn't account for matching with replacement

3. **Single Matching Algorithm:** Only nearest neighbor implemented; production systems should offer alternatives (optimal matching, genetic matching)

4. **No Automatic Covariate Selection:** Users must specify confounders based on domain knowledge

5. **Limited Sensitivity Analysis:** Conceptual discussion of Rosenbaum bounds but no full implementation

**Pedagogical Note:** These limitations are intentional—a "perfect" implementation would obscure the core concepts. Students are encouraged to explore specialized packages for production work.

---

## 4. Assessment & Effectiveness

### 4.1 Validating Student Understanding

**Formative Assessment (During Learning):**

1. **Expected Outputs:** Each code section includes expected outputs so students can self-verify
   - Propensity score ranges
   - Number of matched pairs
   - SMD values before/after matching
   - ATT estimate magnitude

2. **Reflection Questions:** Embedded questions prompt metacognition
   - "Why is the naive estimate biased toward zero?"
   - "What would happen with unmeasured confounders?"

3. **Diagnostic Function:** `psm_diagnostic_report()` provides automated feedback on common issues

**Summative Assessment (After Learning):**

Three levels of exercises assess different cognitive depths:

| Level | Type | Assesses |
|-------|------|----------|
| Beginner | Conceptual questions | Understanding of selection bias, confounding |
| Intermediate | Implementation task | Ability to apply PSM to new data |
| Advanced | Critical analysis | Evaluation of real-world study limitations |

**Rubric for Exercise Evaluation:**

| Criterion | Excellent | Satisfactory | Needs Work |
|-----------|-----------|--------------|------------|
| Conceptual Understanding | Correctly identifies all confounders and bias direction | Identifies most confounders | Confuses correlation with causation |
| Implementation | Code runs correctly, matches expected output | Minor errors, mostly correct | Significant errors or incomplete |
| Critical Thinking | Identifies multiple limitations, proposes solutions | Identifies obvious limitations | Accepts results uncritically |

### 4.2 Common Challenges Students May Face

**Conceptual Challenges:**

| Challenge | How Tutorial Addresses It |
|-----------|---------------------------|
| Confusing prediction with causal inference | Explicit section on "Why Simple Comparison Fails" |
| Misunderstanding propensity score meaning | Multiple explanations: verbal, mathematical, visual |
| Assuming PSM "proves" causation | Repeated emphasis on assumptions and limitations |
| Difficulty with potential outcomes framework | Concrete examples with patient scenarios |

**Technical Challenges:**

| Challenge | How Tutorial Addresses It |
|-----------|---------------------------|
| Logistic regression convergence issues | StandardScaler applied; debugging guide covers this |
| Poor overlap/no matches | Visualization of PS distributions; diagnostic checks |
| Interpreting SMD values | Clear thresholds (0.1, 0.25) with visual guides |
| Understanding caliper selection | Explanation of 0.2 × SD convention; sensitivity analysis |

**Common Mistakes Addressed:**

The dedicated "Common Mistakes & Debugging Tips" section covers:

1. Including post-treatment variables
2. Propensity scores at 0 or 1
3. Ignoring balance checks
4. Wrong caliper scale
5. Forgetting to set random seed

### 4.3 Addressing Different Learning Styles

**Visual Learners:**

- Extensive use of diagrams (DAGs, flowcharts, ASCII art)
- Multiple visualization types (histograms, Love plots, bar charts)
- Color-coded outputs (✅ ⚠️ ❌)

**Reading/Writing Learners:**

- Comprehensive markdown documentation
- Detailed comments in code
- Summary tables throughout

**Auditory Learners:**

- Companion video follows Explain → Show → Try structure
- Verbal walkthrough of concepts before code

**Kinesthetic Learners:**

- Hands-on implementation from scratch
- "Try It Yourself" challenges
- Starter template with TODOs to complete

### 4.4 Future Improvements and Extensions

**Planned Enhancements:**

1. **Interactive Widgets:** Add ipywidgets for real-time parameter exploration
   - Slider for caliper adjustment
   - Dropdown for matching algorithm selection
   - Live-updating balance plots

2. **Additional Matching Methods:**
   - Optimal matching
   - Coarsened exact matching
   - Genetic matching

3. **Full Sensitivity Analysis:**
   - Implement Rosenbaum bounds
   - E-value calculations

4. **Real-World Datasets:**
   - Add examples with messy, realistic data
   - Include missing data handling

5. **Integration with Causal Graphs:**
   - DAG-based confounder selection
   - Connection to do-calculus

**Student-Suggested Extensions:**

Based on anticipated feedback, potential additions include:

- Video walkthroughs for each section
- Interactive quiz questions
- Comparison with other causal methods (IV, DiD, RDD)
- Case studies from published research

---

## References

1. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. *Biometrika, 70*(1), 41-55.

2. Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. *Multivariate Behavioral Research, 46*(3), 399-424.

3. Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. *Journal of Economic Surveys, 22*(1), 31-72.

4. King, G., & Nielsen, R. (2019). Why propensity scores should not be used for matching. *Political Analysis, 27*(4), 435-454.

5. Cunningham, S. (2021). *Causal Inference: The Mixtape*. Yale University Press.

6. Huntington-Klein, N. (2021). *The Effect: An Introduction to Research Design and Causality*. Chapman and Hall/CRC.

---

## Appendix: AI Assistance Acknowledgment

This pedagogical report and associated tutorial materials were developed with assistance from Claude (Anthropic) for:

- Code debugging and optimization
- Generating synthetic datasets
- Creating explanatory diagrams and flowcharts
- Proofreading and formatting
- Structuring pedagogical content

All pedagogical approaches, learning objectives, and educational design decisions represent original work by the author. The use of AI assistance enhanced productivity while maintaining academic integrity through transparent acknowledgment.